The impact of calibrating soil organic carbon model Yasso with multiple datasets

Soil Organic Carbon (SOC) models are important tools in determining global SOC distributions and 10 how carbon stocks are affected by climate change. Their performances are, however, affected by data and methods used to calibrate them. Here we study how the Yasso SOC model performs if calibrated individually or with multiple datasets and how the chosen calibration method affected the parameter estimation. We found that when calibrated with multiple datasets, the model showed a better global performance compared to a single dataset calibration. Furthermore, our results show that more advanced calibration algorithms should be used for 15 SOC models due to the multiple local maximas in the likelihood space.


20
Soil Organic Carbon (SOC) models are important tools in estimating current global soil carbon stocks and their future development (Manzoni and Porporato, 2009). Soils are the second largest global carbon pool, hence even small changes in this pool impact the global carbon cycle (Peng et al. 2008). However, SOC and SOC changes are difficult and laborious to measure (Mäkipää et al., 2008). They can also vary drastically over space due to differences in litter fall, site and soil type as well as climate (Jandl et al., 2014, Mayer et al., 2020. Hence, to 25 quantify the global SOC stocks and estimate the effects of different drivers, such as changing environmental conditions, on SOC stocks (Sulman et al., 2018, Wiesmeier et al, 2019, numerous SOC models have been developed in the past decades (Parton et al., 1996;Cammino-Serrano et al., 2018;Thum et al., 2019).
Three central research challenges have emerged in the development of SOC models. First, the majority of such 30 models still rely on linear equations representing the movement of C within the soil. This approach has been questioned by the arguments that some of the SOC processes such as the microbial influence or response to a large scale environmental change need to be represented by non-linear equations (Zaehle et al., 2014;Liang et al. 2017) or that the state structure of the model affects which kind of data can be used to calibrate it (Tang and Riley, 2020). More complicated SOC models addressing these arguments have been developed, for example

35
Millennial (Abramoff et al, 2018), and modules including additional drivers affecting the C pools have been included in existing SOC models, such as nitrogen (Zaehle and Friend, 2010) and phosphorus (Davies et al, 2016;Goll et al., 2017) cycles. Their implementation is hindered, though, by the second challenge as data are needed to constrain the model parameterization, but individual measurements campaign datasets are often limited in size and lacking in detail of the SOC state (Wutzlerand and Reichstein, 2007;Palosuo et al., 2012). While the initial Yasso07 calibration addressed the challenges regarding the variety of data required, it did not touch in detail on the issues affecting the actual SOC model parameterization process. First, the Yasso07 did not calibrate all the parameters simultaneously with all the data, but instead calibrated the parameters in segments where the previously calibrated parameters were set as constant when calibrating the next set of parameters (Tuomi et al., 2011). While this makes the calibration process easier, it naturally also affects the results and 75 associated uncertainties as well. Second, there has been no standard methods established to evaluate how the inclusion of additional datasets impacts the general performance of SOC models. In other words, does using multiple datasets improve the model estimates? Naturally, this applies to Yasso07 as well. Third, there have been studies which indicate that the choice of parameterization method does matter in ecosystem modelling (Lu et al., 2017). It is reasonable to assume that would also hold true for SOC systems where there could be multiple parameter sets that can potentially produce a local fit into the data. Last, but not least, the previous Yasso07 calibration workflow was not easily repeatable and reproducible to allow inclusion of new datasets and algorithms.
In this study, we build upon previous Yasso developments and present a new model formulation and a 85 calibration protocol, that we call Yasso20 hereinafter. Our redesigned calibration protocol leverages BayesianTools R-package (Hartig et al., 2019), an open source general-purpose tool for Bayesian model calibration. Using BayesianTools in our workflow, we not only not only established a more reproducible and standardized application of Yasso20 calibration, but also leveraged interfacing with multiple calibration algorithms and examined the role of the calibration method.

90
Due to the nature of the available SOC related datasets we hypothesize: I) advanced parameter estimation methods account better for multiple potential likelihood maximas than simpler methods, and II) the SOC model performs better globally if multiple datasets are simultaneously used to constrain SOC models compared to a SOC model calibrated with an individual dataset.

95
The first hypothesis is tested by comparing the Yasso parameter values produced by parameter estimation methods of varying complexity and how well they converge. Second hypothesis is tested by calibrating the Yasso with individual datasets as well as the combined data sets with the resulting performances compared using numerous validation datasets. All these calibrations are done for all the parameters simultaneously.

100
Furthermore, the more extensive calibration process has allowed constraining more details in the new Yasso formulation which is introduced here as well.

Yasso model description
The Yasso model is based on four basic assumptions on litter decomposition and soil carbon cycle: 1) Litter consists of four groups of organic compounds (sugars, celluloses, wax-like compounds and lignin-like compounds) that decompose at their own rate independent of origin (Berg et al., 1982). 2) Decomposition of any 110 group results either in formation of carbon dioxide (CO2) or another compound group (Oades, 1988). 3) The decomposition rate is affected by environment temperature and moisture (Olson, 1963;Meentemeyer et al., 1978;Liski et al., 2003). 4) The diameter size of woody litter determines the decomposition rate (Swift, 1977).
Yasso20 is the next version of Yasso (Liski et al. 2005) and Yasso07 models (Tuomi et al., 2009(Tuomi et al., , 2011b) and continues to build on these same assumptions. For the purposes of the calibration here, another assumption was 115 necessary: 5) The most stable soil carbon compounds are only formed in the soil as a result of bonding with mineral surfaces (Stevenson, 1982).
Based on the previously established assumptions, litter can be divided into four fractions according to their chemical composition. Compounds soluble in a polar solvent (water) represent sugars (W) and those soluble in a non-polar solvent (ethanol or dichloromethane) represent wax-like compounds (E). Compounds hydrolyzable in 120 https://doi.org/10.5194/gmd-2021-273 Preprint. Discussion started: 24 September 2021 c Author(s) 2021. CC BY 4.0 License. acid (for example sulphuric acid) represent celluloses (A) and the non-soluble and non-hydrolyzable residue represents lignin-like compounds (N). Additionally, there is a fifth compartment, humus (H), which represents long-lived, stable soil organic carbon produced by interaction with mineral compounds in the soil. As the carbon compounds are broken down by the decomposition processes, they become either new compounds belonging to another compartment or CO2. The decomposition rate of each compartment is considered independent of the 125 litter origin and affected by a temperature, moisture, and size component.
The masses (x) of the compartments at time t are denoted by vector Yasso model uses an annual timestep and determines the changes in those masses according to ( ) where b(t) is the litter input to the soil at the time t, θ is the set of parameters driving decomposition as defined

130
in Table 1 and c contains the factors controlling the decomposition. As not only are accurate soil moisture estimates challenging to obtain for the measurements used here, but a vast majority of them are from the surface. Thus, air temperature T and precipitation P were used as the environmental drivers along with the woody litter diameter d. Operator M is the product of the decomposition, as presented by K, and mass fluxes between compartments, as depicted by F, equations as follows 140 Here parameters pij ∈ [0,1] denote the flows from compartment i (i ∈{A,W,E,N}) to j (j∈{A,W,E,N,H}) and are included in the parameter vector θ. The decomposition rates ki(θ,c) were calculated according to where φ1, φ2, and r are parameters included in the parameter set θ.

155
Given initial state x0, average environmental conditions c and constant litter input b(t) = b, the model prediction can be computed by solving the differential equation in Eq. (1). The solution becomes where the matrix exponential is determined numerically. In a steady state situation x = limt→∞x(t), equation 7 becomes 160 = − ( , ) −1 ,

Yasso20 improvements
Two main changes were introduced to the Yasso20 version here compared to the earlier Yasso07 version. The first change was that the temperature input for Yasso20 is given as the mean monthly temperature for each 165 month of the year instead of the mean annual temperature and associated annual temperature amplitude. This was done in order to better represent the more nuanced global temperature profiles. For example, the previous scheme was indifferent if the winter was long or short, which is, however, expected to affect the annual decomposition. The second change was to differentiate the climate driver impacts between the AWE, N and H pools instead of using the same parameter values for all the model C pools. This was done because previous 170 research established that more complex carbon compounds require more energy to be broken up (Davidson and Janssen, 2006), which indicates that the parameters representing those dynamics should also differ between pools. It is expected that these changes will affect the model performance and the calibration results themselves, especially as this allows the environmental conditions to impact the pools differently. Thus this changed model version was decided to be a new version of the model. We do not compare Yasso20 performance to Yasso07 175 here. All model parameters given in Table 1 were targeted in the calibration.

Datasets used in the calibration
Several datasets were simultaneously used to calibrate the model in order represent different processes related to 180 soil carbon cycling: Decomposition bag time series data from the Canadian Intersite Decomposition Experiment (CIDET; Trofymov, 1998), Long-Term Intersite Decomposition Experiment (LIDET; Gholz et al, 2000) and European Intersite Decomposition Experiment (ED; Berg et al., 1991aBerg et al., , 1991b projects, a collection of global soil organic carbon measurement gathered by Oak Ridge National Laboratory (Zinke et al., 1986) and woody matter decomposition dataset from Mäkinen et al. (2006). In addition to these large datasets, a smaller litter bag 185 decomposition data set from Hobbie et al. (2005) was used to both evaluate how much addition of a comparatively small number of data points affects the calibration results as well as an independent validation dataset for the other calibration parameters. These datasets along with additional details are listed in Table 2. The global SOC measurement dataset from Oak Ridge National Laboratory (Zinke et al., 1986) is collected from the data of numerous unrelated projects that have measured SOC as a part of their campaign. As such,

200
there are/were no uniform applicable protocols to these measurements. For the purposes of the calibration, the data is assumed to represent the steady state SOC at that location and each measurement is treated as independent from the others even if they are from the same location. Furthermore, we only used SOC measurements that were below 20 kgC m -2 in the calibration. Values higher than those were found in high latitudes and considered as results of waterlogging, peat formation or permafrost, processes not described in  Table 1. In addition, SOC chronosequence data from Liski et al. (1998) and 210 plot level measurements of Liski and Westman (1995) was used as a validation data set.
The woody decomposition data used here is from Mäkinen et al. (2006), which has measurements of multiple trees in different stages of decomposition over several decades in Finland. There are no signifiers to connect the measurements from different years nor to indicate how much the tree diameter has been reduced over time 215 because the data was not chronosequence data of the same trees. As such, the measurements were considered independent and representative of decomposition of a tree trunk of that size.

220
The information of the uncertainty related to the measurements was limited. acknowledge the current limitations regarding assigning the uncertainties.
Systematic differences in the litter bag properties affected the use of different datasets (Tuomi 2009;Tuomi2011b). In general, high mass loss rates were positively correlated with a large mesh size of the litter bags and high precipitation in our datasets. This is because the decomposing material in the litter bags is partially

235
'washed away' into the surrounding soil by water flow and is thus removed from the bag due to processes other than decomposing. To correct for this, we added a leaching term to equation 1 as follows, where ωsite is the dataset-specific leaching term and I5 is a 5×5 identity matrix. This approach was simplified as there are multiple components expected to affect the leaching process and other systematic errors, but it was 240 necessary to establish even this simplistic initial approach for the work here.
Finally, long-lived carbon compounds represented by the H pool in the Yasso model are not produced in decomposition litter bags as they require organo-mineral associations which are unlikely to occur in the litter layer" 245 that is only possible in the soil. Because of this pH (transfer fraction from AWEN pools to pool H) could have non-zero values only with the Oak Ridge global SOC dataset.

Calibration protocol 250
We used the BayesianTools R-Package (Hartig et al., 2019) in our calibration workflow for its standardized and flexible implementation of Markov chain Monte Carlo (MCMC) algorithms with external models, as well as for its post-MCMC diagnostic functionality. While our main aim in this paper was not to compare MCMC algorithms, once the interface was established with the BayesianTools, it was trivial to leverage the common setup and test the performances of different MCMC flavors as implemented by the package. We found this 255 exercise helpful as our calibration problem involves a relatively high dimensional and irregular likelihood surface. It has been previously shown that for such systems the efficacy of the calibration may differ between algorithms (Lu et al., 2017). Thus, we tested two robust and efficient algorithms Differential Evolution Markov Chain with snooker updater (DEzs, ter Braak and Vrugt, 2008) and Differential Evolution Adaptive Metropolis algorithm with snooker updater (DREAMzs, Vrugt et al., 2009;Laloy and Vrugt, 2012;Vrugt, 2016), in 260 addition to the long-established adaptive Metropolis (AM) algorithm (Haario et al., 2001).
All three algorithms use Markov chains to explore the parameter space and generate samples from the posterior.
However, AM uses a single chain, whereas DEzs and DREAMzs use multiple interacting chains simultaneously.
While DREAM emerged from DE, DREAM further uses adaptive subspace sampling to accelerate convergence (Vrugt, 2016 when the target distribution (a.k.a. posterior) is also Gaussian. Whereas, DEzs and DREAMzs algorithms use the differential evolution principle to optimize the multivariate proposals (with snooker jumps to increase the diversity of the proposals), automatically adjust the scale and orientation of the proposal distribution according to the target distribution (Vrugt et al., 2009;2016). As a result of these properties, especially when not tuned 270 properly, AM can take much longer to complete the high-dimensional parameter search and can suffer from premature convergence when multiple distant local optima are present (Vrugt, 2016;Lu et al., 2017). Whereas DEzs and DREAMzs can potentially resolve non-gaussian, high-dimensional and multimodal target distributions more effectively without much configuration (Laloy andVrugt, 2012, Lu et al., 2017).

275
In our calibration protocol, we ran 3 chains for each algorithm where DEzs and DREAMzs further tripled each chain. We initialized these chains from the prior distributions (Table 1) using the random sample generator of the BayesianTools package. Each chain was run for 1.5 x 10 6 iterations and the last 1.5 x 10 5 iterations were used to compute the posterior probability distributions after removing the burnin. Convergence diagnostics were checked by visually inspecting the trace plots of the chains, as well as calculating the multivariate R-statistic of
Initially the calibration was done with all the parameters associated with the Yasso20 model. However, if the estimated parameter values for the p-terms in eq. 3 were within three decimals from either 0 or 1, they were set to nearest limit value of 0 or 1, after which the calibration was redone. During the calibration, the p value 285 parameterization can never settle at 0 or 1 and, hence, it is impossible to know what the real p value is that close to the limit. The calibration results presented here only had four p values that were not set: pWA, pWN, pEW and pEA. Parameters pAW and pNA were set to 1 and the other AWEN related p values were set to 0. Furthermore, since we assumed that only decomposition in the W pool results in CO2, we estimated only pEW and set pEA to be the E-pool remnant from 1 with pEN set to 0.

Validation protocol
Each of the litter decomposition experiments (CIDET, LIDET and ED) was randomly split into two: data used for calibration (80% of the measurements) and data used for validation (20% of the measurements).
Furthermore, the random division is done so that the whole measurement time series from one bag is always 295 fully either in calibration or validation data. It was also verified that each site and species in was approximately equally represented in both the calibration and the validation data. Due to the noise and bias in both the global SOC measurement data sets in addition to the separate processes included in those calibrations, we did not divide them into calibration/validation parts but used all the data for calibration.  Table 2. Second, a SOC chronosequence from Liski (1998) was used to determine if Yasso20 is able to realistically simulate the SOC accumulation over time scales of 320 hundreds of years. In this dataset there are 26 soil age gradient data points from the Finnish coast which has been used to approximate the SOC accumulation in the soil over hundreds of years after the ice age. Tree litter and climate driver data from Hyytiälä, Finland was used here as the main focus is on if the simulated system reaches steady state in the same time window as the measurements. The climate driver data used for these validation runs is included in Supplemental Table 3.

355
With regard to the long-term SOC projections, the comparisons with the Hyytiälä forest plot measurements (Table 6; Figure 3) indicates that at least in the Nordic forests Yasso20 potentially slightly overestimates the steady state SOC, with the largest differences still being below 2 kg C m -2 . It should be noted, though, that there is notable variance within the measurements in addition to the uncertainty related to the driver data. The chronosequence data (Figure 4) shows that the model projection saturates approximately in 1000 years similarly 360 to the measurements.

Residual analysis
When checking residuals from the litter bag experiments against mean annual temperature, annual temperature 365 variation and total annual precipitation ( Figure 5), there appears to be a tendency for Yasso20 to increasingly underestimate the remaining litter bag C with growing average mean temperature and precipitation. The error does not, though, show any signal when looking at the temperature variation within the year. With the woody decomposition residuals (Figure 6), there is a slight negative trend over time and a slight positive trend over size. Both are minor, though, and the residuals for the woody decomposition are relatively evenly distributed for 370 the validation dataset.

Parameter values correlation
Analyzing the correlations between different parameter values produced by the DEzs algorithm (Figure 7) 375 shows that the correlations are the strongest between processes affecting the same pools. The p-terms which had been set to 0 and 1 were excluded from the correlation analysis since they did not vary during the calibration.
The AWE pools decomposition rates have strong positive correlations between the decomposition rates as well as with the climate driver terms affecting decomposition in them. Similarly, there are strong negative correlations between the temperature terms affecting the same pools and a strong positive correlation between

Calibration method
Here we showed that by using a DEzs calibration algorithm, we were able to simultaneously use multiple different types of datasets to constrain the soil organic carbon (SOC) model Yasso and produce a converging parameter set. Additionally, using a more conventional model calibration approach, here the Adaptive need to be applied to minimize the impact of the resulting uneven parameter space and producing Gelman-Rubin values within more acceptable ranges (Gelman and Rubin, 1992). Something that was curious in our results was that DEzs converged better than DREAMzs (Table 3) despite the latter being a more state-of-the-art 400 method (Vrugt, 2016). We were not able to determine the reason for this in our tests here, specifically was it something related to the behaviour of the parameter space or to how the method is implemented in BayesianTools.

405
One of the fundamental challenges for calibrating SOC models is lack of experimental information regarding the model parameter value distributions. Therefore, we used generally broad uniform prior distributions for the calibration here. However, it is still important to evaluate the calibration results based on our understanding of the overall system behaviour. For example, initially we used wider priors for parameters pH and αH (Results not 410 shown), which in turn resulted in the calibration producing a pH value of ~0.08 and, consequently, a much higher H pool decomposition rate. As this did not fit with the system behavior seen i.e. with the bare fallow experiments (Menichetti et al., 2019) or the soil chronosequence (Fig 3-3), we applied a narrower prior constraint on the related parameters. Another, and a more, complicated example is that when using wider prior constraints for the N pool decomposition rate parameter αN, the calibration resulted in the N pool being largely size of the global calibration data set, smaller local data sets cannot effectively be used just by adding it to the joint calibration process. There are other options, though, by either using the globally estimated parameter ranges as the priors for a calibration with the local data or employing a hierarchical calibration approach (Tian et al., 2020, Fer et al., 2021, but the impact of these approaches should be separately researched and tested. Our study still successfully provided a global parameter set that increases the applicability of Yasso model and

Leaching
As established in section 2.2, in order to compare the measurements from different litter bag experiments, there 445 needs to be a parameter that accounts for the litter bag types' impact on the mass loss rate (Tuomi et al., 2009).
When testing with independent litterbag data, we see that even with this added assumption, the global calibration produces a better fit than the calibration based on individual litterbag campaigns (Table 5). This supports using data from multiple litterbag campaigns in model calibration However, in the results it is evident that not only are the leaching parameters estimated to be essentially zero when calibrating only with individual 450 decomposition bag data sets (Table 4), but also when simultaneously calibrating with all the data sets, only the ED dataset ends up having a meaningfully non-zero value. First of all, this indicates the current straight-forward formulation for leaching is insufficient as with the individual dataset calibrations the other parameter values are able to produce fits where there is no leaching despite knowledge that it is a factor. Second, even when calibrating multiple data sets simultaneously, the calibration appears to apply the leaching effect to only one of 455 the datasets even when it should affect all of them. Third, the further complication is that by being a corrective term, the leaching term will also reflect all other systematic differences between the datasets. As a consequence of all these factors, the leaching impact needs to be further studied and the relevant equations need to first be formulated with experimental data specifically gathered for that purpose. There is an important point concerning the parameterized humus (H) formation term pH here. The long-term H formation can only take place in the soil itself as it requires the presence of mineral compounds (Schmidt et al., 2011), which is why only the global soil carbon dataset in this study could be used to constrain H parameters.

465
However, they are only point measurements with no information of how the state changes over time. Therefore, we have to assume that the measurements represent the approximated steady state from an assumed litter fall.
This not only causes larger parameter uncertainties, but also the estimated pH parameter value will represent the fraction of the total litter fall that ends up in the H pool while in reality with the surface vegetation litter there needs to be an additional mechanism that transfer the carbon compounds to soil while root litter is already in 470 that environment. Consequently, if examining litter decomposition taking place only in the soil, such as with roots, it is likely that pH for that soil system would be larger than what is estimated here. This would fit with previous research suggesting that the root biomass specifically appears to be connected to the amount of longterm carbon in the soil as more of it would be able to form H compounds than the surface vegetation (Clemmensen et al., 2013;Jackson et al., 2017). However, currently the amount of data that would allow 475 efficiently separating the above and below soil decomposition processes during the calibration process is limited. Additionally beyond this, there are presence of mineral compounds and other conditions that affect how efficiently H is formed that should be included when formalizing H formation (Rasmussen et al., 2018). Better addressing the formation of H is a crucial development step for the model, but the current approach provides an initial way to estimate the H pool size/quantity.

Temperature and precipitation impact
At first glance it appears that the current version of Yasso20 overestimates SOC decomposition (i.e. underestimates SOC amount) at higher precipitation and temperature values, as indicated by the negative trend 485 in Fig 3-4. In the current formulation of environmental drivers (eq. 5), only the lower precipitation values decrease the decomposition rate with the system becoming insensitive to increases in precipitation after a certain threshold. However, it is known that at higher moisture levels the SOC decomposition rates decrease (Keiluweit et al., 2017). A more informative driver of moisture conditions (e.g. monthly soil moisture) and a more realistic response function could help disentangle the reasons behind this trend in the residuals in the future. The current 490 version of Yasso20 uses precipitation as the driver instead of soil moisture because the decomposition bags from the data sets used as constraints here are on the surface and thus were expected to be primarily controlled by precipitation. In the light of current findings, next steps in Yasso model development towards using soil moisture as model drivers are planned.

495
Closer examination of the error distribution over the climate drivers, though, suggests some more complexity.
Even at the lower precipitation values while both CIDET and ED data errors cluster approximately equally around zero, the LIDET data points show a shift towards negative errors similarly to at higher precipitation values. Thus, it appears that the issue is at least partially due to the data set itself rather than the pure precipitation signal. Similar behaviour can be seen with mean temperature, although it isn't as pronounced.

sets.
It is yet unclear if this is due to something with the measurements, something with the processes or if the climate driver data is not similarly representative of the conditions as with the other used data sets.

505
In the current Yasso20 implementation, the woody litter diameter does not change during the decomposition process while in reality the wood shrinks as it decomposes. This explains why when comparing the model results to the tree decomposition validation dataset (Fig 3-5), the model overestimates the decomposition rate for decades old tree stems with a measured diameter of approximately 10 cm. In those cases, the model assumes 510 that was the size of the trunks when the decomposition started and, consequently, the size impact is smaller than it should be. While the model still performs well with the validation database regardless of this, it is an important aspect to consider when applying Yasso20 model with woody decomposition.

Conclusions
Soil organic carbon (SOC) models should be constrained by data from multiple different ecosystems and reflecting the various dynamics affecting the SOC decomposition process. Using data from multiple datasets produced parameter sets which performed better in a global comparison than parameter sets calibrated with 520 information from individual datasets, highlighting the necessity of using more data. However, the traditional AM calibration method had difficulties converging to a single parameter set when used with multiple datasets, most likely due to the numerous local likelihood maximas within the likelihood space, and our deliberate choice for avoiding detailed algorithm-specific configurations which reduces repeatability and re-applicability. the data used and the mathematical aspects of the methods, respectively. Prof. Jari Liski is the PI of the project this research was a part of and has created the Yasso model used here.

Data availability
The Yasso model used here can be downloaded from https://github.com/YASSOmodel/Ryassofortran. The