Cosmic-Ray neutron Sensor PYthon tool (crspy 1.2): An open-source tool for the processing of cosmic-ray neutron and soil moisture data

. Understanding soil moisture dynamics at the sub-kilometre scale is increasingly important especially with continuous development of hyper-resolution land-surface and hydrological models. Cosmic Ray Neutron Sensors (CRNS) are able to 10 provide estimates of soil moisture at this elusive scale and networks of these sensors have been expanding across the world over the previous decade. However, each network currently implements its own protocol when processing raw data into soil moisture estimates. As a consequence, this lack of a harmonized global dataset can ultimately lead to limitations in the global assessment of the CRNS technology from multiple networks. Here we present crspy, an open-source python tool that is designed to facilitate the processing of raw CRNS data into soil moisture estimates in an easy and harmonized way. We outline 15 the basic structure of this tool discussing the correction methods used as well as discussing the metadata that crspy can create about each site. Metadata can add value to global scale studies of field scale soil moisture estimates by providing additional routes to understanding catchment similarities and differences. We demonstrate that current differences in processing methodologies can lead to misinterpretations when comparing sites from different networks and having a tool to provide a harmonized dataset can help to mitigate these issues. By being open source, crspy can also serve as a development and testing 20 tool for new understanding of the CRNS technology as well as being used as a teaching tool for the community.


Context and Background
Soil moisture exerts a large influence on hydrological (Van Loon et al., 2015), biogeochemical (Schlesinger et al., 2015), and climatic processes (Dobriyal et al., 2012;Koster et al., 2004), agricultural systems (Fontanet et al., 2018;Dutta et al., 2014), landslide modelling (Zhuo et al., 2019) and earth system sciences (Fang and Lakshmi, 2014;Bonan, 2008). Its accurate 25 measurement is important to advance our understanding of these areas of research. In-situ point scale soil moisture estimates, such as Time Domain Reflectometry (TDR), can provide higher temporal resolution; however spatial resolution is still limited, on the order of centimetres. Soil heterogeneity can lead to uncertainties when upscaling to the field scale (Western et al., 1999), which would be required for regional or larger scale hydrological modelling. Alternatively, satellite remote sensing products such as Soil Moisture Active Passive (SMAP) and Soil Moisture and Ocean Salinity (SMOS) can provide global estimates of 30 soil moisture at a coarser spatial (~40km resolution) and temporal (~3 days) scale, and at much shallower depths (~5cm) (Entekhabi et al., 2010;Kerr et al., 2001). It is accepted that we will require a finer spatial resolution than currently achievable through remote sensing estimates for tasks such as increasing our understanding of sub-kilometre land-atmosphere interactions, or towards the improvements of farming practices (such as through the process of irrigation scheduling), and so there is a need for additional processing of ancillary data for the downscaling of these products (Portal et al., 2020;Alemohammad et al., 35 2018). In addition, the recent push for hyper resolution global modelling means we require measurements at a finer spatial resolution, on the order of sub-kilometer scales (Wood et al., 2011). Bierkens et al., (2015) discussed the implications of moving from a more standard resolution ~50km model to a hyper-resolution model that is sub-kilometre. The study further discussed the need to move from sub-grid paradigms, that represent a conceptualized form of earth system dynamics from within the standard 50km resolution model, to explicit dynamics of earth system processes at scales < 50km. This requires a 40 greater understanding of environmental functions at sub-kilometre, spatial scales, which in turn requires accurate measurements of environment states at the same scales.
Cosmic-Ray Neutron Sensors (CRNS) are a relatively new technology that allows estimates of soil moisture at the field scale (~600m diameter) at hourly temporal resolution. Zreda et al., (2008) demonstrated that fast neutrons are mainly moderated by 45 hydrogen atoms, which allows us to infer changes in water content in the soil profile. A tube attached to the sensor, filled with a gas such as helium or boron trifluoride, is able to detect fast neutrons that pass through it by inducing a voltage difference. Desilets et al., (2010) introduced an equation used to convert neutron counting rates into gravimetric soil moisture which has been further improved upon by Dong et al., (2014) and Hawdon et al., (2014) where !"# is volumetric soil moisture (cm 3 /cm 3 ); * , + , ' are coefficients obtained from neutron particle physics modelling (Zreda et al., 2008;Desilets et al., 2010) and assumed to be constants; is the lattice (chemically-bounded 55 mineral) water (g g -1 ), is the water equivalent of soil organic carbon (g of water per g of soil), ,-is the bulk density of the dry soil (g/cm 3 ) and is the density of water defined as 1 g/cm 3 . .$) is the measured raw, uncorrected, neutron count identified over the given integration time, usually set to one hour. / , 0 , 1 , ! represent correction factors applied to The detection of background neutrons in the atmosphere, as a method to infer estimates of field scale soil moisture, was first 65 described in Zreda et al., (2008). In that study, the authors demonstrated that neutron intensity above the surface was inversely correlated with the amount of moisture in the soil below. This was developed further in Desilets et al., (2010), where the initial form of equation 1 was first described and applications of this technology continued to be explored within the earth sciences community (Desilets 2011;Franz et al., 2012;Rivera Villarreyes et al., 2011). A large-scale network of these sensors was subsequently deployed across the United States leading to the Cosmic-Ray Soil Moisture Observing System (COSMOS) 70 .
After the establishment of the first national-scale network in the US , other countries such as Australia (Hawdon et al., 2014), Germany (Zacharias et al., 2011, Bogena et al., 2016, and the UK (Evans et al., 2016, Cooper et al., 2021 established their individual national networks, as well as additional sensors located in smaller networks or individual 75 sites. Sensors from these networks have in some cases been running for up to 10 years, which can be potentially valuable for the understanding of soil hydrology. As these networks have grown so has the literature surrounding best practices for calibration and correction of the sensor signals, allowing us to have a lower uncertainty in CRNS soil moisture estimates Rosolem et al., 2013;Hawdon et al., 2014;Baatz et al., 2015;Schrön et al., 2017). As a consequence of improvements to the signal correction and sensor calibration, a divergence in methods is noticeable between different networks. 80 Each network inevitably implements its own protocol when correcting the neutron signal to give soil moisture estimates leading to a less harmonized dataset among networks. This is in part due to the difficulties that would be encountered in quickly changing data processing pipelines within already established databases. The benefit of such structures is that live data is available to stakeholders through online portals. Unfortunately, the interdependencies of a database does not lend itself to quick changes and so a post processing method could alleviate some of these issues. 85 This lack of a harmonized global dataset can ultimately lead to limitations in the global assessment of this technology from multiple CRNS networks. Discrepancies in processing methodology can leave questions around information obtained, and uncertainty propagated, from analysis and comparison of sensors in different networks, such as whether soil moisture signals can be attributed solely to environmental differences or processing differences. By not necessarily following all recommended 90 correction steps, the estimated soil moisture products from these sensors or even networks can be seen as sub-optimal, potentially undermining their true value. An example of the impact of evaluation sub-processed cosmic-ray soil moisture data from the US network against land surface models is presented by Dirmeyer et al. (2016). There is a consensus to follow certain steps and guidelines which are not uniformly applied across all networks. Known corrections to account for changes in atmospheric pressure, neutron intensity, atmospheric water vapour and aboveground biomass are applied differently and on 95 occasion not at all on some networks which could lead to different estimates of soil moisture , Hawdon et al., (2014), Evans et al., 2016). For example, Rosolem et al., (2013) demonstrated the influence on the neutron signal that occurs from changes in atmospheric water vapour over time. When comparing processed soil moisture estimates with, and without this additional signal correction, they demonstrated a difference of up to 0.1 cm 3 /cm 3 at a site at Park Falls, USA.
Additionally, Hawdon et al., (2014) demonstrated the different approaches available in correcting neutron counts for incoming 100 cosmic-ray intensity and showed that there is a noticeable difference in neutron counts and ultimately soil moisture depending on the chosen method. Schrön et al., (2017) provided an improved approach to CRNS calibration demonstrating that their revised approach improves accuracy of soil moisture estimates. Using UK sites as an example, Schrön et al., (2017) found that the Root-Mean-Squared-Error (RMSE) of soil moisture estimates from the CRNS was reduced from 5.3% volumetric, using the conventional calibration approach, to 1.4% volumetric, using the revised calibration approach. Improvements in accuracy 105 were identified at all the sites they analysed. Although this revised approach is being adopted in more recent studies (Cooper et al., 2021), this is not always the case (such as the original sites in the COSMOS network) and can mean that sites in different networks have been calibrated using different methods.
In order to mitigate this ongoing issue of lack of harmonization in the soil moisture estimates from the CRNS technology, we 110 present here an open-source python tool to process raw CRNS data into soil moisture estimates, using the most current methods identified in the literature. It is designed to allow a user to apply consistent data processing methods across sensors that may be located in different networks. Section 2 will describe the structure of the tool along with the relevant correction and calibration methods. It will also describe the site metadata creation process which is an additional aspect to crspy that is built to facilitate data analysis of many sites. Section 3 will discuss the implications of differing processing methodologies on soil 115 moisture estimates, as well as the benefits of creating detailed metadata for post processing analysis.

The crspy tool
The Cosmic Ray neutron Sensor Python tool (crspy, pronounced "crispy"): is a tool written in Python3 that has been developed to facilitate the processing of the global networks of CRNS data in a uniform and harmonized way. It is available through an open-source repository and can be installed into a user's python environment. The tool is designed to allow the easy 120 implementation of the most up to date correction factors and calibration processes to any CRNS site globally, ultimately allowing for any user to access a harmonized dataset. Although it is designed for multiple sites from varied networks, crspy is versatile enough to process a single site as well. It is being provided to help facilitate research in the CRNS community and is not intended to state whether one networks processing methods are superior to another. It is the authors' opinion, however, that it is important for the community to consider creation of a best practice, as this will allow comparison of sensor data 125 around the world in the future. In addition, crspy is structurally designed to accommodate new corrections and processing steps that may become available in the future in an easy manner. By being open source, crspy can also serve as a development and testing tool for any new understanding of the CRNS technology, as well as a teaching tool for the community. Figure 1 is a visual representation of the processes within crspy that converts raw sensor data into corrected soil moisture 130 estimates. Due to the varied nature of input data, such as when different networks label data differently, it is first necessary for a user to correctly format input data following crspy's naming convention (see Table A1 in Appendix). Additionally, to organise the various input and output datasets a specific working directory folder structure is necessary. This allows crspy to automatically handle the numerous sources of data. After installing the package a user can build this folder structure easily with the function crspy.initial(wd) where wd is a string representing the working directory location. 135

140
(5) represents the calibration process (if this option is selected) (see Section 2.2). (6) highlights the quality assessment steps undertaken (see Section 2.3). Finally, (7) represents the step where soil moisture estimates are calculated from the neutron counting rates (refer to equation 1). DATE   To obtain soil moisture estimates, we need to apply equation 1 at each time step in the data. The values will be obtained from time varying sensor data, external data products, static site-specific values and static values that are not site-specific. The coefficients [ * + ' ] are constants with values of 0.0808, 0.372 and 0.115, respectively, defined in Desilets et al., (2010).
These values are fitting constants that describe the shape of the relationship between neutron counts and soil moisture, obtained from neutron particle physics modelling, and are the same for all sites. These values are stored in the config.ini file, which 150 stores constant values for crspy.

Site-specific soil properties
The site-specific soil parameters described in Equation 1 are , (obtained from soil organic carbon) and ,-. Due to the open data policies of many of the CRNS networks this data is usually available online (see data availability statement). 155 These values should be defined prior to running the main crspy function and are stored and read from the metadata file.
The parameter corresponds to the lattice water (%), which represents the hydrogen contained in the mineral structures of the soil (Hawdon et al., 2014). As fast neutrons are mitigated by hydrogen atoms, regardless of their source, this will have an overall impact on the neutron count rate. This value is usually obtained through analysis of soil samples taken from the footprint of the site sensor . The parameter represents the water equivalent of soil organic matter (g/cm 3 ). 160 Soil organic carbon (SOC) is obtained through analysis of soil samples and represents the total organic carbon in the soil at the site. Hawdon et al., (2014) discuss the need to convert this value into a water equivalent and provided a method for this (Equation 2). This is completed on the assumption that organic matter in the soil is cellulose which means that proportionally the water equivalent of this can be found by: The parameter ,-represents the dry soil bulk density (g/cm 3 ) and is a site-specific static value. It is obtained through analysis of soil samples and is used in the conversion of gravimetric soil moisture to volumetric soil moisture values. If dry soil bulk density data is unavailable for a site, crspy includes the option to obtain this value from the global data source SoilGridsv2 170 (see Section 2.4). In case of missing data, crspy takes advantages of built-in routines to fill out the information. In that case, if ,-or (used to calculate ) are missing, then crspy will use the estimates collected from SoilGridsv2, which is collected in the metadata process. If is unavailable a value of zero can be input into the metadata table by the user. There have been studies which demonstrate techniques to estimate using soil clay content, which could be used to provide estimates that can be input to the metadata table (Avery et al., 2016, McJannet et al., 2017. Notice that the other site-specific 175 static value is the * number. This number is found through the calibration process which is described in greater detail in section 2.2.

Time varying values and correction methods
The remaining values required to obtain !"# are .$) and / , 0 , 1 , ! which all vary with time. It is ultimately the 180 relationship between .$) and * which gives us our ability to estimate volumetric soil moisture once the additional corrections have been applied. The parameter .$) is obtained from the sensor data and will usually be representative of the number of neutrons counted over a one-hour time period. This is the measured raw (uncorrected) neutron count; however, we know that there are additional impacts on this count rate that require correction which are represented by the factors in equation 1. Changes in atmospheric pressure impact the neutron counting rate, the / term corrects for this so that .$) * / 185 gives the neutron count rate as if it were taken at the reference atmospheric pressure. Changes in incoming cosmic-ray intensity will directly influence neutron count rates as this is the source of fast neutrons and so the 0 term will correct this to match a reference date in time. Atmospheric water vapour and above ground biomass are additional sources of hydrogen, outside of the soil moisture source we are interested in, and so the 1 and ! terms adjust the count rate in consideration of this. These correction methods have been improved upon since the technologies first implementation with additional sources of 190 uncertainty identified and equations designed to mitigate their impact.
There are occasional data availability issues observed at some sites. For example, meteorological variables are a necessary part of converting neutron counts to soil moisture estimates because they are needed to account for the numerous impacts on the signal, such as pressure corrections and atmospheric water vapour corrections. On occasion, some of the sites do not measure 195 all the necessary variables considered to be essential to correct for additional sources on the neutron signal. External relative humidity sensors are essential in correcting for changes in atmospheric water vapour but are not always included in site data.
When data is unavailable through in-situ site sensors, ERA5-Land (Muñoz Sabater, J., 2019) data are used to replace missing sensor data. ERA5-Land is a dataset, based upon the ERA5 reanalysis data, that combines modelled data with real world observations, resulting in a gridded, global hourly product at 9km resolution, provided publicly by the European Centre for 200 Medium-Range Weather Forecasts (ECMWF). Previous iterations of the ERA reanalysis datasets (such as ERA-Interim) have proved useful by other global networks for the task of gap filling missing data, such as in the FLUXNET community (Vuichard and Papale, 2015). We are implementing a similar approach used by the FLUXNET community in crspy, and consequently to the global CRNS database, as we envision the potential of a merged database incorporating both flux tower and CRNS soil moisture data in the future. As the two measurement technologies show similar temporal and spatial footprints, their combined 205 use can eventually lead to a better understanding of land-atmosphere interactions at the field scale, for example (Iwema et al., 2017). It is important to note that although the resolution is spatially coarser when compared with CRNS sites, the ERA5-Land dataset was chosen as a source for replacing missing sensors for three main reasons: (1) it covers the lifetime of all the CRNS sites around the world which ensures all historical data to be used for gap-filling if necessary; (2) the dataset is produced at hourly resolution which matches the standard resolution of CRNS sites; (3) this is an open data source which aligns with our 210 desire to develop a full open-source tool for CRNS data processing.
The ERA5-Land dataset includes key variables such as precipitation, temperature, and dewpoint temperature which can be used to correct for influence on the neutron signal, such as using dewpoint temperature when relative humidity sensors are not available at the site . Our choice also follows previous studies that demonstrated that ERA-Interim tended 215 to perform best when compared with other global reanalysis products (Decker et al., 2012). ERA5, which ERA5-Land is derived from, has benefitted from a decade of research when compared to ERA-Interim and has been shown to be a great improvement (Hersbach et al., 2020).

(i) Atmospheric Pressure correction ( ) 220
Changes in atmospheric pressure can have an impact on neutron counting rates measured by the CRNS , Hawdon et al., 2014. This is attributed to the fact that higher atmospheric pressure reduces neutron counting rates as there are more particles in the air column that can slow fast neutrons down. In crspy this is corrected with the equation: where / is the pressure correction factor (defined in Equation 1), is a coefficient to account for mass attenuation length at the site, is the atmospheric pressure at the site (hPa) and 0 is a reference atmospheric pressure (hPa) for the site, commonly taken as the mean pressure for the site's elevation. The coefficient and the reference atmospheric pressure value are calculated for each location as a function of the latitude, elevation, and cut-off rigidity at the site as described in Desilets et al., 230 (2021).

(ii) Incoming High-Energy Neutron Intensity ( )
It is important to correct for incoming neutron intensity as this will have a direct impact on neutron counting rates. Changes in the incoming cosmic-ray intensity will affect the number of fast neutrons in the atmosphere as increased cosmic-ray intensity 235 will lead to an increased counting rate created through the cascade of reactions (Desilets et al., 2006). We use the data from the Neutron Monitoring Data Base (NMDB) available online, providing a collection of neutron monitoring sites from around the world. The NMDB provides neutron counting rates at hourly resolution from monitoring stations around the world, its data is considered the official distribution from each site principal investigator. The correction method currently varies across networks. For example, the COSMOS (USA) originally corrected the data by comparing neutron intensity to a pre-defined 240 reference date, in that case, assumed to be 01 May 2011. The Jungfraujoch neutron monitoring station in Switzerland was used as a reference site. The calculation for this is shown as follows: is the incoming cosmic-ray intensity at sensor measurement time and 0 is incoming neutron intensity at the decided reference date and 0 ′ here is used to define this particular incoming cosmic-ray intensity correction factor, to avoid confusion with 0 from equation 1.
The default approach in crspy, however, is to use the approach outlined in Hawdon et al., (2014) where the Jungfraujoch 250 monitoring station is used but an additional correction for differences in site cut-off rigidity is applied with: where is the correction for differences in cut-off rigidity (GV), is the cut-off rigidity at the sensor location and 255 is the cut-off rigidity at the Jungfraujoch monitoring station (which has a value of 4.49 GV). This is applied at each time step to give a final corrected value with: Ultimately 0 is the similar to 0 ′ but contains an additional correction to account for the difference in cut-off rigidity between the CRNS site being processed and the Jungfraujoch neutron monitoring reference site.
The Australian CosmOz network uses a different approach which does not always use the Jungfraujoch as the reference monitoring station. Instead, this network will change the reference station based on the stations which has the closest cut-off 265 rigidity (GV) to the sensor site from the Neutron Monitor Database (Hawdon et al., 2014). This option is also available in crspy when running the main processing function crspy.process_raw_data(fileloc, intentype="nearestGV") by invoking intentype with the "nearestGV" option. This involves identifying the NMDB site with the nearest cut-off rigidity and applying equation 4.

(iii) Atmospheric Water Vapour ( )
Hydrogen atoms can slow down fast neutrons leading to a reduction in the count rate with increasing atmospheric water vapour.
This signal needs to be removed to ensure that neutron counting rates are attributed to soil moisture and not moisture in the air. This is corrected at each time step with the following equation : where 1 is the atmospheric water vapour correction factor and is absolute humidity (g m -3 ). Some sites do not have external relative humidity sensors that can be used to calculate vapour pressure, which can be used to calculate absolute humidity along with temperature. When this is the case then ERA-5 Land data can be utilised by converting dewpoint 280 temperature ( o C) to vapour pressure (kPA) (for further information on the steps to obtain absolute humidity from standard meteorological variables, please refer to the appendix section in Rosolem et al., 2013).

(iv) Above Ground Biomass (AGB) ( )
Similar to other sources of hydrogen, biomass can also affect the neutron counting signal. There have been numerous attempts 285 to identify the relationship between AGB and neutron count rates (e.g. Rivera Villarreyes et al., (2011, Heidbüchel et al., (2016) and Tian et al., 2016). Unlike other sources of hydrogen, AGB is sometimes not available from local samples at each site. In order to reduce the impact of AGB on the measured neutron signal, crspy currently uses a static estimated value for each site from the European Space Agency (ESA) Climate Change Initiative (CCI) global dataset and apply the correction method based on the work of Baatz et al., (2015), who found a linear relationship between above ground 290 biomass and neutron counting rates.
The following equation is used: where ! is the above ground biomass correction factor and is the dry above ground biomass at the site (kg/m 2 ). The ESA CCI database provides above ground biomass estimates as a global gridded data product at 100m resolution (Santoro and Cartus, 2019). As the ESA CCI data currently used is a static value in time, it will not impact the soil moisture estimates, in principle, because the correction is applied on both the .$) and * numbers, hence mitigating any impact. Nevertheless, we 300 have included this routine in crspy in this form as we anticipate improvements to dynamical above ground biomass corrections in the future, at which point crspy can be updated to include the latest theory that can be applied across all sites (Franz et al., 2018;Vather et al., 2020;Fersch et al., 2020). Further improvements to be able to dynamically account for biomass changes at all CRNS sites will be important for reliable estimation of soil moisture dynamics, especially when analysing sites with land-use changes or cropping cycles. 305

Sensor calibration
The above steps give us all the values in equation 1 necessary to provide a soil moisture estimate, except for * . A required step in processing, and eventually using the data, is to calibrate the CRNS to the specific conditions found at the site of interest.
Without this step, the soil moisture can potentially have significant biases and deemed unusable. Alternatively, the uncalibrated measurement can only give you a rough idea about the dynamics of the soil wetness conditions in relative terms. The calibration 310 step typically requires multiple soil samples (typically 100s) taken from within the sensor footprint and oven dried to get an accurate representation of soil moisture at the calibration time. These samples are then weighted and averaged to give a field scale soil moisture estimate of the sensor footprint (note that we use dry soil bulk density, ,-, sampled within the footprint to estimate volumetric water content in cm 3 cm -3 ). The crspy tool uses the soil moisture averaging method obtained from field samples proposed by Schrön et al (2017), which is based upon the original work of Köhli et al., (2015). The method provides 315 an updated approach for weighting soil moisture samples taken within the footprint that considers spatial distance from the sensor of each sample as well as the influences of pressure and humidity during the sampling period. This allows for a more accurate estimate of independent soil moisture within the CRNS footprint for the calibration step. Schrön et al. (2017) suggested improved sampling strategies which included samples closer to the sensor (< 5m radius from the sensor) and sample locations guided by knowledge of local hydrological features. The data required for the calibration step includes the date of 320 the sample, an integer to represent each soil moisture profile (a core of soil taken from within the sensor footprint), the depth of each sample within each profile, the distance from the sensor, and the volumetric soil moisture of the sample. Again, these should be named following the template requirements by crspy (see Table A2 in Appendix). Calibration datasets are openly available from some of the networks at already established sites, such as CosmOz and COSMOS, and can be obtained from their respective websites. Alternatively, if a user was setting up their own sensor, then a sampling campaign would be required 325 such as that described in Schrön et al., (2017).
With regards to number of calibration days, crspy is flexible enough to process both single-day or multiple-day calibration campaigns. Multiple calibration campaigns were shown to improve the CRNS signal (Iweema et al., 2015). For the case of multiple-day calibration, all calibration days should be presented in a single table, ensuring that the correct dates of each sample 330 period are provided, and following the same formatting and naming requirements used for single-day calibration.
Finally, when running crspy for a single site, the user is able to turn on or off the calibration process. This is included because calibration only needs to be done once, as * does not vary with time. When the calibration step is turned on, crspy will call the calibration routine and write the output to the metadata table in the column 'N0'. If the calibration routine is turned off, crspy will skip this step and simply read the * number for the site from the metadata. Alternatively, the user can provide the 335

Quality assessment
All data should be checked for quality to ensure that erroneous data is not included, and crspy includes some automated steps to begin this process. All networks implement quality assessment on neutron counts in order to remove poor quality data (e.g., 340 Zreda et al., 2012;Hawdon et al., 2014;Evans et al., 2015). In crspy, we remove suspicious data points by applying flags to neutron counts that fall within four categories, the below rules are consistent with the application in other networks:

1.
Counts that differ by 20% from the previous time step are removed 2.
Counts above N0 are removed 4. Battery voltages below 10V are removed Flag 1 is applied on the raw, uncorrected neutron count as we are interested in identifying sudden jumps in counting rate in the sensor that a believed to be in error. Flags 2 and 3 are applied to the corrected neutron count. This is because the N0 number 350 is itself a corrected number (i.e. it is the maximum number of neutrons at the site under theoretical dry conditions, once additional environmental influences on N have been taken into account and removed from the signal). In the case of flags 2 and 3 the N and N0 number need to both be corrected to be comparable.
Additionally, crspy will output time series diagnostic plots of all variables used for identifying patterns in data that point 355 towards potential issues which may require a small subset of the data to be removed manually (this, of course, depends on the quality of the data from individual sites and, therefore, cannot be fully automated).

Metadata
Metadata is an important piece of information that allows the user to better describe each site characteristics beyond its soil 360 moisture dynamics. The information can be extremely useful especially when multi-site regional to global CRNS stations are to be analysed simultaneously. The metadata of each site is stored in a tabular format within the folder structure of the working directory, a full description of the columns is given in the Appendix (Table A3 in Appendix). It serves two main purposes.
Firstly, it stores static site-specific variables that are used in computing estimated soil moisture values (e.g., , and ,-). To provide an example, ,-is necessary to convert gravimetric soil moisture estimates into volumetric soil moisture 365 estimates in equation 1. The ,-value is collected during the calibration campaign at each site and will vary between sites. It represents an averaged value taken from the soil samples and it is stored in the metadata. The user should also give each site a country code which represents the country it is located in and a unique site number for each CRNS site. The country code is used to help identify geographic locations in analysis and helps when the site numbering of networks may overlap. Raw time series data should be titled with the country code and number in the following format: _SITE_ .txt. Where 370 is a capitalised letter code and the is a 3-digit number. For example, sensor data for a site in the UK could be titled: UK_SITE_101.txt. This (i.e., UK_SITE_101) is used to identify each site when organising the outputs, as well as a lookup code for constant variable values stored in the metadata.
A second purpose of the metadata is to act as a resource when analysing many sites together. The ability to classify catchments 375 by physical characteristics can allow us to understand key similarities and differences between sites, an important direction in hydrological research (Wagener et al., 2007). To increase the value of the metadata, as well as including data collected at the site, global data products have been integrated. These products are all public products that a user can download and store within the folder structure of the working directory. We realize that these global datasets are not a direct replacement for the invaluable information obtained at the site; however, in many cases, such pieces of information are not available, undermining 380 any multi-site analysis. We believe the use of the datasets described in detail below can provide us key information at regional and global level. In crspy, a simple function is used to extract the information from the data products below when provided with the location of the CRNS (i.e., latitude and longitude): (i)

ESA CCI Land Cover and Above Ground Biomass data: The European Space Agency (ESA) Climate Change 385
Initiative (CCI) provides numerous global data products that are useful in the earth sciences community. Land cover data and above ground biomass data are obtained from ESA CCI and stored in metadata for each site for analysis through identifying site differences and similarities. Both products are spatial consistent with the CRNS sensor range (100m-300m) and are available globally. The usefulness of ESA CCI datasets in land surface modelling continues to be established (Li et al., 2017). 390

(ii)
International Soil Reference and Information Centre (ISRIC): The ISRIC provides a global data product that gives estimates of soil properties on a 250m resolution grid. This is available as SoilGridv2 which is an updated (as of May 2020) iteration of the original SoilGrid product (Hengl et al,. 2017). The properties are estimated from collections of ground measurements that are compiled by the World Soil Information Service (WoSIS). WoSIS provide standardised soil profile data to facilitate the creation of products such as SoilGrid (Batjes et al., 2020). 395 (iii) ERA5-Land: As discussed previously meteorological variables from ERA5-Land data can be downloaded for each site. Mean annual precipitation and temperature data is stored, along with derived Köppen-Geiger classifications.

Running the tool.
Once the working environment has been prepared the data can be processed with 400 crspy.process_raw_data(fileloc, Calibration=True, intentype=None). Where fileloc is the location of the raw sensor data, the Calibration process can be turned on or off as a Boolean descriptor and intentype can be left as None to enact the default process for incoming neutron intensity correction or can be changed to "nearestGV" to utilise the alternative method. Once applied crspy will process the raw data using the provided information to give soil moisture estimates and will output figures and tables into the folder structure of the working directory. A description of the final output 405 file and what each of the standard columns represent is given in Table A4.

Benefits of data harmonization
As mentioned previously, one of the key purposes of crspy is the easy and harmonized processing of CRNS sites from around the globe, as there is currently no true consensus on what correction steps are implemented in different national networks. 410 These technical differences can lead to changes in outputs which may result in non-optimal conditions for regional/global analysis from multiple countries. Whereas some users may wish to understand changes at one particular site, inter-site comparisons are limited when each site could be processed in a different way. In this section, we highlight such impacts with one example related to the individual sensor corrections steps, and their impact on the final soil moisture estimates.

415
Table 1 outlines three identified methods that are currently employed across different networks. Method _ 1 is the method employed at the COSMOS (USA) network, which lacks the atmospheric water vapour correction and applies an intensity correction using only the Jungfraujoch neutron monitoring site directly. Method _ 2_ closely resembles the CosmOz (Australia) network methodology, which does apply the atmospheric water vapour corrections and an intensity correction that differs from method _ . In this case, the neutron monitoring station used as an incoming neutron intensity reference is 420 changed to the nearest station with a similar cut-off rigidity to the CRNS site being corrected. Method _ 3_ _ is the default crspy method, and resembles the methods used by COSMOS-UK, while also allowing for the above ground biomass correction to the neutron signal. In this final case, the intensity correction uses Jungfraujoch as its reference site but with an additional correction to account for differences in cut-off rigidity between Jungfraujoch and the site (equation 5).

435
With all these different correction approaches applied independently from each national network, we investigate both the impact on the measured neutron counts and consequently the propagated effects on the estimation of soil moisture. Figure 2 shows two sites with distinct hydroclimatic regimes, both taken from the COSMOS-USA network, that have been processed in the three identified methods (see highlighted star markers in Fig. 3 and Fig. 4). The Santa Rita Creosote site is shrubland dominated with a soil categorized predominantly by sandy loam type located in Arizona, USA. The site has a mean annual 440 temperature of 19 o C and a mean annual precipitation of 335 mm, which primarily falls in winter storms and monsoonal summers (Köppen-Geiger climate classification BSh, a hot semi-arid climate). Climate data taken from ERA5-Land and Köppen-Geiger classification derived from ERA5-Land data using the method outlined in Peel et al., (2007). The Wind River site is an old-growth mixed conifer forest site in Washington, USA. The site is much wetter than the Santa Rita Creosote site, with an annual precipitation of 2,200 mm, and much colder, with an average annual temperature of 8 o C. Precipitation at Wind 445 River tends to fall all year round but with slightly lower rates in the summer period (Köppen-Geiger classification is Csb, a Mediterranean climate mild with dry, warm summers). Climate data has been extracted from ERA5-Land and Köppen-Geiger classification derived from 10 years of ERA5-Land data using the method outlined in Peel et al., (2007). The raw neutron data from both sites were obtained directly from the COSMOS network, representing the _ case in Table 1. In addition, in order to compare the impact of the different correction approaches outlined in Table 1, the raw data from the CRNS at both sites, 450 have been processed in crspy to give the corrected signals for methods _ 2_ and _ 3_ _ . Table 1) are shown in panels (a) and (b).

Derived soil moisture estimates (cm 3 cm -3 ) are shown at daily and monthly timescales in panels (c) and (d) and panels (e) and (f), respectively.
It is clear to see the inverse relationship between neutron count rates and soil moisture, most noticeably at Santa Rita Creosote (Figures 2a and 2c). The soil moisture here tends to be low, such as in June when it was below 0.05 cm 3 cm -3 , which is to be expected in a hot semi-arid environment. Sudden spikes in soil moisture can be attributed to precipitation events, with the 460 summer monsoonal precipitation causing a sudden increase in the mean soil moisture values for the months of July, August, and September (and, inversely, periods corresponding to decreases neutron counting rates). It is also clear that the method chosen has an impact on soil moisture values. This is most notable when comparing the _ 1 method with both the

_ 2_
and _ 3_ _ methods. During the summer months, the _ 1 method appears to estimate higher soil moisture values compared to the other two methods (both appearing to be much more closely aligned to each other). This is 465 likely due to the fact that the _ 1 method does not account for changes in atmospheric water vapour. As a consequence, during the monsoonal summers when there is more hydrogen in the atmosphere from increased humidity, the relatively high water vapor in the atmosphere is incorrectly attributed to additional soil moisture. This is because the CRNS records wrongly attribute the decrease (attenuation) of neutron counts due to water vapor to an increase in soil moisture, causing an over estimation. For example, even early in March there is a sudden rise in soil moisture from the _ 1 estimates which does not 470 appear in the other two methods (Figure 2c). This suggests that rather than a sudden rise in soil moisture, this was actually a rise in atmospheric water vapour. This demonstrates the importance of removing external impacts on the neutron signal as they could be incorrectly attributed to soil moisture dynamics. The negative effect of neglecting such correction, for example, can be pronounced even more on monthly estimates of soil moisture due to the aggregated nature of this error (Figure 2e).

475
The Wind River site is a much wetter site when compared to Santa Rita with its driest month matching Santa Rita Creosotes wettest month. It is worth noting in that in the case of Wind River there is a much larger difference between the neutron count rate of method _ 3_ _ compared to the other methods (Figure 2b). This is because the _ 3_ _ method includes an above ground biomass correction, using the ESA CCI above ground biomass product to calculate a correction.
Currently, as this correction is applied using a static aboveground biomass value (constant with time), the impact of the 480 correction is not translated to differences in estimated soil moisture. This is due to the correction being applied to both the neutron counting rate and the * term as well. With dynamic data, that represents changes in above ground biomass over time, we would be able to improve our estimates of soil moisture, as the impact of changing above ground biomass could be removed from the neutron signal. One additional noticeable feature that crspy implements is the capping of soil moisture to more realistic values, in this case 0.68 cm 3 cm -3 . The _ 1 method does not do this and so there are physically impossible values of 485 volumetric soil moisture in February and December, as seen in figure 2d. In crspy maximum values for soil moisture are estimated by inferring porosity of the soil: where _ is the maximum volumetric soil moisture value (cm 3 cm -3 ), ,-is soil bulk density (g/cm 3 ) and is the density of ground material (estimated with an assumed density of quartz at 2.65 g/cm 3 ). If a user did not wish to enable this cut-off value, then the value for _ can be set to one in the metadata.
At the Wind River site, the differences between _ 2_ and _ 3_ _ are much more noticeable, especially when 495 the soil moisture estimates are aggregated to monthly timescales (Figure 2f). This observed difference is due to the fact that these methods do not apply the same correction for incoming cosmic-ray intensity ( 0 ). Such differences are caused by the choice of correction rather than physical controls on soil water dynamics. This can lead to inaccurate comparisons across sites from different national/regional networks. For example, identifying useful soil moisture signals that can be used to categorise the hydrology of sites will be an important tool for understanding differences and similarities with regards to hydrology. 500 Branger and McMillan (2020) demonstrated this in their paper looking to identify useful soil moisture signals that can be robust, discriminatory, and representative, with research continuing in this area of developing useful diagnostic soil moisture signatures (Araki and McMillan, 2020). When reducing large time series data into signatures, such differences can be aggregated which could begin to affect conclusions. However, the authors stress here that it is not within the scope of this work, nor the intention, to identify which method is better or worse than the other, but rather highlight the potential negative 505 impacts of the lack of a harmonized dataset for large-scale global assessment of the CRNS technology.

Usefulness of crspy metadata
Metadata can be used to describe the network of CRNSs around the world geographically, climatologically, and hydrologically.
To achieve this, crspy compiles relevant data obtained directly from the sensor, key data descriptors provided from each site 510 or network, and from global data products. Wagener et al. (2021) discuss the need for high quality metadata to improve our ability to understand the knowledge accumulation in the field of hydrology. The metadata can be valuable in separating relevant sites in different groups, for example researchers may be interested in understanding how soil moisture behaves at sites above 2000m elevation with certain land use types and given particular weather events (Chen et al., 2020); or sites where mean annual precipitation is above/below a certain threshold; or even grouping sites by different land cover or soil types. So called 515 meta-analyses can help a researcher identify which sites should be included in their studies and which can be excluded (Evaristo and McDonnell, 2017). The metadata provided by crspy allows the user to quickly obtain any grouping of interest in an easy and accessible way.
In order to demonstrate some of the features that can be easily accessed with the help of metadata, we show an example using 520 the compiled COSMOS network data for continental USA (CONUS). Some of this data are taken directly from the network website and then processed using the crspy.fill_metadata() function. This function collects information from global data products at specific site location (i.e., latitude and longitude), as well as using meteorological data from ERA5 Land to produce annual meteorological summaries (e.g., mean annual temperature, mean annual precipitation, Köppen-Geiger climate classification, etc). Figure 3 gives an example of how the metadata can be easily used to show the location of each sensor in 525 the continental USA domain (CONUS) based upon the supplied with additional information, in this case, main land cover classes obtained from the CCI ESA Land Use data. An important step here is that the user is not required to download and process the land cover data separately and individually. crspy incorporates that step for the user seamlessly. In addition to locating the CRNS stations and identified the main land cover type, Figure 4 shows a scatter-histogram of the 535 sites across CONUS providing additional annual meteorological summaries, namely mean annual temperature and mean annual precipitation. The scatterplot still retains the information about the main land cover type obtained from the ESA CCI global database. In addition, both meteorological variables are shown as side-histograms and were computed using the ERA5 Land data. Initial analysis indicates that CRNS classified as shrublands tend to be relatively warmer and drier. Grassland and forests tend to be wetter while showing a wider range of temperatures. Croplands are slightly warmer than grassland and forests 540 but still showing lower temperatures than those observed in shrublands. However, croplands also indicate a slightly wider range of wetness, when compared to the grassland and forest sites, as observed from the total annual precipitation. This could be useful when deciding which sites should be used in a particular study, such as a study on soil moisture dynamics in shrublands with low overall precipitation. Alternatively, it can be used in big data analytics when trying to identify the dominant mechanisms on soil moisture dynamics globally. 545 The objective of metadata in crspy is to easily collect a wide range of site characteristics information that can be used to improve our knowledge of soil moisture, and consequently, other hydrological and environmental processes beyond just a single site. By doing so, it allows for knowledge accumulation across multiple sites (from local to regional and even global) highlighting key similarities and any emergent patterns (e.g., hydroclimatic, ecological, etc). Metadata analysis has not 555 yet been fully exploited in hydrological sciences (Evaristo and McDonnell, 2017) but can also contribute to knowledge accumulation which can be translated to help the designing of improved or new perceptual or conceptual models (Wagener et al., 2021). An early example of that within the cosmic-ray neutron sensing community is clearly demonstrated by Shuttleworth et al. (2013) during the conceptual development of the COsmic-ray Soil Moisture Interaction Code (COSMIC). COSMIC was developed as a forward observational operator allowing for the conversion of simulated soil moisture profile by land surface 560 or hydrological models into equivalent neutron counting rates which facilitates data assimilation applications (Rosolem et al., 2014). By collecting and accumulating information from (at that time) 42 available COSMOS sites (see Table 1 in Shuttleworth et al., 2013), the authors were able to simplify the requirement for two of the prescribed parameters by establishing relationship with dry soil bulk density (see Figure 6 and Equations in 6 and 7 in Shuttleworth et al., 2013). crspy will certainly facilitate such efforts in the future to help both experimental and modelling scientists with the 565 potential to reach other disciplines beyond traditional hydrological and environmental sciences. For example, a prototype version of crspy has already used for Space Weather application recently (Hands et al., 2021).

570
In this paper we have presented crspy, an open-source python tool for the processing of cosmic-ray neutron sensors. Our aim in developing crspy is to provide a tool to the community that can provide methods to process CRNS data easily and that can be updated in the future to keep pace with our increasing understanding of the sensor signal. Due to this evolving understanding of the sensor, we expect to be updating crspy regularly in the future to accommodate our new understanding of the technology along the years. 575 Köhli et al., (2021) recently presented research that demonstrates a revised formulation of the key equation that converts neutron counts into soil moisture estimates (see Equation 1). This emphasises the need to be able to update CRNS estimates to keep pace with the research as well as to test newer formulations across a range of sites quickly. In version 1.2 of crspy we maintain the Desilets et al., (2010) version of equation 1 as the default setting but have provided a supplementary document 580 that describes how a user could update crspy on their home machine to implement the revised approach. This document serves two functions, it is to demonstrate how to update crspy so that researchers may be able to test newer methods on a broad range of sites. But it also speaks of a more general need to agree on what is the standard approach in processing CRNS data. We believe it will be an important step in the future for the numerous stakeholders in CRNS measurements to agree upon a standard approach. This must be decided as a community and we should look towards the positive steps other communities have taken 585 in this regard, such as the flux community (Novick et al., 2017).
Another aspect of development in crspy will be in making it more accessible and user friendly. We consider one of the key functions of crspy is to act as a tool for researchers, giving a way to update processing methods and apply it quickly on a collection of data. On top of this we would hope it can be used as an education tool, helping newer users understand how the 590 sensor functions and what is required to fully correct it based on our current understanding. This could include developing crspy into frameworks such as Python Dash, which are powerful tools for exploring data. SUPPLEMENTAL Soil moisture is an important component of the hydrological cycle and understanding its dynamics at relevant spatiotemporal scales is critical especially with recent advances of global land surface and hydrological models. The CRNS technology is able 595 to provide estimates of soil moisture at the sub-kilometre scale and at hourly resolution. This is particularly relevant now as we continue to move towards hyper-resolution global modelling efforts. Over the years, with an increase of adoption to the technology, the CRNS community has acquired a better understanding about the benefits and limitations of this relatively novel technique. However, due to a lack of data harmonization across networks, undertaking global scale analyses is currently very limited and unexploited. Here, we introduced the crspy python package with the aim to facilitate users to process data 600 easily with the most current methods, and most importantly, in a harmonized fashion. crspy is an open-source tool aimed to integrate the latest developed methodologies for CRNS to be used both for research and teaching activities. It integrates highquality global data products (such as ERA5-Land) to ensure that all sites can be included in analysis. This is done in a similar way to other well-established global environmental networks such as the Ameriflux and Fluxnet.

605
Our examples of application demonstrated that when CRNS data is processed with different methodologies, it can ultimately lead to divergences in soil moisture estimates. This could potentially have a negative impact on the analysis and overall findings, especially when sites across multiple networks are evaluated simultaneously. By harmonizing data processes, we envisage that CRNS data will be used more widely by the global modelling and experimental communities, leading to further adoption of the technology. The objective of crspy is to provide an open and easy-to-use data processing platform that can 610 enable easy processing of CRNS data. Additionally, crspy data collection relies on the production of an extensive metadata archive. This archive can be shared and used within the community to better understand key aspects about soil moisture from typical sampling locations, to inform on signature behaviour by different groupings. Crspy has been developed to show the potential for easily and efficiently processing CRNS data in a harmonized way. The aim is to promote the usefulness of free and open access data and engage the CRNS and research communities in the continued improvement of this product in the 615 coming years.

Appendix
The appendix section consists of two appendices. Appendix A consists of four tables that outline the naming conventions required for crspy to run as well as the output table along with a description of each variable. When labelling input data, columns titles should match the styles below in the 'COLUMN NAME' column. This initial step will then allow crspy to run 620 smoothly, as it uses column titles to identify relevant data sources. Appendix B provides some examples of the automatically generated outputs of crspy along with a description of their purpose. Tables to describe variables names and outputs  Table A1. The naming convention for CRNS input data. Networks can occasionally have different naming conventions 625 (e.g., temperature is referred to as t1). Changing the column titles to the following format will allow crspy to function correctly.   The SM value with a 12-hour rolling average applied to it. Minimum number of values to calculate the 12-hour average is 6 hours of data within the 12-hour window D86avg cm The depth of the measurement -taken as the depth from which 86% of neutrons are estimated to be sourced from (Schron et al., 2017) D86avg_12h cm
We acknowledge the NMDB database (https://www.nmdb.eu), founded under the European Union's FP7 programme (contract no. 213007) for providing neutron count data. ESA CCI data including above ground biomass data and land cover data are available from (http://cci.esa.int/data last accessed 26/02/2021). The soil grids data are accessible online from https://soilgrids.org/. The ERA5-Land data are provided by ECMWF and are available at https://doi.org/10.24381/cds.e2161bac. 705

Code availability
The code discussed in this paper can be found at https://doi.org/10.5281/zenodo.5484111. The GitHub repository, where future updates will be uploaded can be found at https://github.com/danpower101/crspy. The GitHub repository also includes a wiki page which goes into greater detail on how to run the package. We have also generated an example walkthrough repository including example data that users can try available at: https://doi.org/10.5281/zenodo.5484102 710

Author contributions
All authors were involved in the conceptualisation of this work. Data curation was undertaken by DD, SD, and DP.
Investigation undertaken by RR, MRR, and DP. Software was written by DP. Manuscript drafted primarily by DP and subsequently written together with RR and MRR.