Bayesian inference of microbial soil respiration models is often based on the
assumptions that the residuals are independent (i.e., no temporal or spatial
correlation), identically distributed (i.e., Gaussian noise), and have
constant variance (i.e., homoscedastic). In the presence of model
discrepancy, as no model is perfect, this study shows that these assumptions
are generally invalid in soil respiration modeling such that residuals have
high temporal correlation, an increasing variance with increasing magnitude
of

Developing accurate soil respiration models is important for
realistic projection of global carbon (C) cycle, as global soils store
2300 Pg carbon, an amount more than 3 times that of the atmosphere (Schmidt
et al., 2011), and release 60–75 Pg C yr

The objectives of this study are to evaluate the impacts of data models on
Bayesian inference and predictive performance of three mechanistic soil
respiration models, and to use the evaluation results to make broader
recommendations. The three models were developed by Zhang et al. (2014) to
simulate the Birch effect (the peak soil microbial respiration pulses in
response to episodic rainfall pulses) at the site scale and at a short
temporal scale; understanding the Birch effect is important to gain a
mechanistic understanding of

Bayesian inference in general uses the Bayes' theorem to update the prior distributions of model parameters to posterior parameter distributions given a likelihood function of data. The mathematical formulation of the (formal and informal) likelihood function requires a probabilistic data model; however, this probability model is intrinsically unknown due to unknown errors in all model components such as model structures, parameters, and driving forces. Bayesian inference of soil respiration models often adopts the assumption of independent, normally distributed, and homoscedastic residuals (e.g., Ahrens et al., 2014; Bagnara et al., 2015, 2018; Barr et al., 2013; Barron-Gafford et al., 2014; Braakhekke et al., 2014; Braswell et al., 2015; Correia et al., 2012; Du et al., 2015, 2017; Hararuk et al., 2014; Hashimoto et al., 2011; He et al., 2018; Keenan et al., 2012; Klemedtsson et al., 2008; Menichetti et al., 2016; Raich et al., 2002; Ren et al., 2013; Richardson and Hollinger, 2005; Steinacher and Joos, 2016; Tucker et al., 2014; Tuomi et al., 2008; Xu et al., 2006; Yeluripati et al., 2009; Yuan et al., 2012, 2016; Zhang et al., 2014; Zhou et al., 2010). These assumptions are conveniently adopted to satisfy the requirement of using an unknown probability model in Bayesian statistics, which was referred to as “a basic dilemma” by Box and Tiao (1992).

Postulating the data models is always based on assumptions about residual statistics, and the most widely used assumptions are paired as follows: (i) independent vs. correlated residuals, (ii) homoscedastic vs. heteroscedastic residuals, and (iii) Gaussian vs. non-Gaussian residuals. For soil respiration modeling few studies have relaxed the non-correlation assumption (e.g., Cable et al., 2008, 2011; Q. Li et al., 2016), the homoscedasticity assumption (e.g., Berryman et al., 2018; Elshall et al., 2018; Ogle et al., 2016; Tucker et al., 2013), and the non-Gaussian and homoscedasticity assumptions (e.g., Elshall et al., 2018; Ishikura et al., 2017; Kim et al., 2014). A recent study by Scholz et al. (2018) relaxed these three assumptions using the generalized likelihood function developed by Schoups and Vrugt (2010). However, few studies have focused on investigating the appropriateness and impact of these assumptions for soil respiration modeling by relaxing the independent residuals assumption (Ricciuto et al., 2011) and the Gaussian residuals assumption (Ricciuto et al., 2011; van Wijk et al., 2008). By relaxing these three assumptions stepwise, to our knowledge this is the first study that systematically evaluates the impact of data model selection on Bayesian inference and predictive performance of soil respiration modeling. In addition, to our knowledge, this is the first soil respiration modeling study that investigates the impact of data models in relation to model fidelity.

Relaxing these three assumptions stepwise results in eight data models, which are shown in details in Sect. 2. For example, combining the assumptions of independent, homoscedastic, and Gaussian residuals leads to the standard least squares data model. This model is the simplest of the eight data models, as it only requires one parameter, i.e., the constant variance of the Gaussian distribution. Note that there is a difference between the soil respiration model parameters and the data model parameters. They can technically be jointly estimated, but one arises from assumptions about soil respiration processes and the other from assumptions about the residuals. Relaxing the homoscedastic assumption to heteroscedastic gives the weighted least squared data model. It is more complex because it has extra parameters to account for multiple variances for multiple data. Whenever one or combinations of the three assumptions (independence, homoscedasticity, and normality) are relaxed, the resulting data models become more complex and require more parameters. Such systematic evaluation of data models (McInerney et al., 2017; T. Smith et al., 2010, 2015) is necessary to evaluate the appropriateness of residuals assumptions and their impacts on Bayesian inference.

The assumptions of heteroscedastic, correlated, and non-Gaussian residuals are accounted for using the method of Schoups and Vrugt (2010) in the following procedure: (i) the correlation is removed from the residuals using an autoregressive model; (ii) the resulting residuals are normalized by a linear model of variance; and (iii) the normalized residuals are characterized using the skew exponential power distribution. The data model parameters (i.e., coefficients of the autoregressive model, the linear variance model, and the skew exponential power distribution) are not specified by users, but are estimated along with the soil respiration model parameters during the Bayesian inference. The skew exponential power distribution is general in that by adjusting the values of its kurtosis and skewness parameters the distribution can produce distributions such as the Laplace distribution (van Wijk et al., 2008; Ricciuto et al., 2011) or the distributions from the study by Tang and Zhuang (2009), which utilized an exponential model with different kurtosis parameters. It is worth pointing out that other methods exist to account for the three assumptions. Evin et al. (2013) suggested accounting for residual heteroscedasticity before accounting for residual autocorrelation. Lu et al. (2013) developed an iterative two-stage procedure to separately estimate physical model parameters and data model parameters. Evin et al. (2014) developed a similar procedure to first estimate model parameters and then estimate heteroscedasticity and autocorrelation parameters. While this study uses the method from Schoups and Vrugt (2010), exploring other methods is warranted in future studies.

After investigating the impacts of the data models on Bayesian inference,
this study evaluates the impacts of the data models on the predictive
performance of the three soil respiration models. Using random samples
generated during the Bayesian inference, a prediction ensemble is produced
for each soil respiration model. The ensemble is used to evaluate predictive
performance of the models in a stochastic sense by estimating extent to which
the models can predict future events. The evaluation in this study is carried
out using cross-validation by splitting the

The remainder of the paper is organized as follows. Section 2 starts with a description of the evolving data models and their corresponding likelihood functions used in Bayesian inference, followed by a brief summary of the three soil respiration models. The results of Bayesian inference are discussed in Sects. 3 and 4, addressing the data model implications on parameter estimation and predictive performance, respectively. Section 5 summarizes the key findings and limitations of this study, and provides recommendations for approaching data model selection.

This section starts with a description of the eight data models that account
for the three pairs of assumptions about residuals in a stepwise manner in
Sect. 2.1. The data models are used to build the likelihood functions used in
Sect. 2.2 for Bayesian inference. The three soil respiration models and
observations of

This study considers eight evolving data models starting from a data model
that assumes independent, homoscedastic, and Gaussian residuals to a data
model that relaxes all three assumptions. The eight data models are
based on the generic normalized residual,

It is not uncommon that residuals are correlated in space and time, due to
the propagation of measurement errors (Tiedeman and Green, 2013) and model
structure errors (Evin et al., 2014; Kavetski et al., 2003; Lu et al., 2013).
The temporal correlation that occurs in the numerical example of this study
can be accounted for by using a

The next four data models are similar to the previous four models except that
the standard normal distribution of

Replacing

Consider a Bayesian inference problem for a nonlinear model,

The data models above can be used to construct the likelihood functions. For
the Gaussian data models given in Eqs. (2)–(5), the corresponding Gaussian
likelihood functions are straightforward (see Eq. 7 for an example). For the
SEP data models, the corresponding likelihood, which is called generalized
likelihood function, is (Schoups and Vrugt, 2010)

In this study, the posterior distributions of the data model parameters are
estimated along with the soil respiration model parameters using the
MT-DREAM

Zhang et al. (2014) studied the Birch effect (the peak soil microbial
respiration pulses in response to episodic rainfall pulses), and developed
five models, evolving from an existing four-carbon-pool model to models with
additional carbon pools and/or explicit representations of soil moisture
controls on carbon degradation and microbial uptake rates. Three of the five
models are used in this study, and they are denoted as 4C, 5C, and 6C. Note
that model 4C is model 4C_NOSM from Zhang et al. (2014), not their model 4C.
Figure 1 is the diagram of model 6C, the most complex of the five
models. The simplest model, model 4C, has four carbon pools, i.e., soil organic
carbon (SOC), dissolved organic carbon (DOC), microbial biomass (MIC), and
enzymes (ENZ), and does not consider the soil moisture control on carbon
degradation and microbial uptake rates. Models 5C and 6C have an explicit
representation of soil moisture controls on the rates. Based on the dual
Arrhenius and Michaelis–Menten kinetics model, the original SOC degradation
rate,

Diagram of model 6C representing the processes of (1) degradation of
soil organic carbon (SOC) to dissolved organic carbon (DOC) through the
catalysis of enzymes (ENZ) produced by microbes (MIC), (2) MIC uptake of DOC,
and (3) microbial (MIC) respiration to produce

In addition to using the new rate equations, models 5C and 6C have more
carbon pools. In model 5C, DOC is split into two sub-pools for the wet zone
and the dry zone of soil pores, and only the wet DOC is used by MIC, as shown in
Fig. 1. The moisture-controlled microbial uptake rate becomes

Due to considering the moisture control and adding more soil pools, model 5C is expected to be significantly better than model 4C for simulating the Birch effect. As the accumulated ENZ in dry soil is secondary, model 6C is expected to be slightly better than model 5C. In terms of model structural error, model 4C has the largest model structure error, model 5C has significantly less model structure error, and model 6C has the smallest model structural error. In other words, model 6C has the highest model fidelity (i.e., lowest model discrepancy) among the three models. As shown below, the degree of model structural error is reflected in the process of Bayesian inference and verified by the cross-validation.

Figure 2 plots the time series of 17 016 observations of soil moisture and

Time series of soil moisture and efflux observations. The dashed line marks the divide of the dataset into calibration and validation periods.

The parameters estimated in this study include the parameters of the soil
respiration models (4C–6C) and the parameters of the data models described
in Sect. 2.1. The estimated parameters of models 4C and 5C include the
microbial carbon use efficiency (CUE) (g g

The DREAM-based MCMC simulation is conducted for a total of 24 cases, the
combinations of eight data models and three soil respiration models. For each
case, the parameter distributions are obtained after drawing a total of

Three criteria are used to evaluate the predictive performance of the soil respiration models and data models: the central mean tendency, the dispersion, and the reliability. Each criterion is measured by a single metric. In addition, a newly defined metric by Elshall et al. (2018) is also used to simultaneously measure the three criteria.

The central mean tendency is measured in this study using the Nash–Sutcliffe
model efficiency (NSME) coefficient (Nash and Sutcliffe, 1970),

In addition to the central mean tendency, it is also desirable that the
ensemble is precise, with small dispersion, and reliable to cover all of the data.
This study uses a nonparametric metric for dispersion, which is the
sharpness of a prediction interval (e.g., M. W. Smith et
al., 2010):

To account for the trade-off between the three metrics, Elshall et al. (2018)
defined relative model score (RMS) that simultaneously measures all three
criteria. Scoring rules are commonly used in hydrology to assess predictive
performance (e.g., Weijs et al., 2010; Westerberg et al., 2011). The RMS is used
in this study to measure the relative predictive performance of the
combinations of soil respiration models and data models. For combination

This section analyzes the residuals of the best realization (with the highest likelihood value) of the MCMC simulation to understand whether the assumptions of the eight data models hold. The impacts of the data models on the posterior parameter distributions are also analyzed.

Residual analysis of the best realization (among multiple MCMC
realizations) for model 6C using the

Figure 3 shows residual plots for model 6C based on the SLS and WSEP-AC data
models. SLS is the simplest data model with the assumptions of homoscedastic,
independent, and Gaussian residuals, and WSEP-AC is the most complex model
without the assumptions. Model 6C is the most complex model and also the best
model as ranked by Zhang et al. (2014) using Bayesian model selection. The
variable

Residual quantile–quantile (Q–Q) plots of the best realization (among multiple MCMC realizations) for the three soil respiration models and eight data models.

Although the Gaussian assumption used in SLS is violated for model 6C
(Fig. 3c), this is not generally the case for other data models and soil
respiration models. This is shown in Fig. 4, which presents the
quantile–quantile (Q–Q) plot for the eight data models and the three soil
respiration models. For SLS, WLS, SLS-AC, and WLS-AC, the theoretical
quantiles are based on the standard normal distribution,

While Figs. 3 and 4 help understand the validity of the three assumptions used in the data models, the impacts of the data models on estimating model parameter distributions must be evaluated separately. This section discusses the impact of the data model selection on parameter estimation with the objective of understanding whether the incorrect specification of the data model necessarily leads to biased parameter estimates. Such assessment is not a trivial task for two main reasons. First, microbial soil respiration models aggregate complex natural processes and spatial details into simpler conceptual representations. As a result, several model parameters are effective values of several complex natural processes that cannot actually be measured in the field, as discussed by Vrugt et al. (2013). Second, even for model parameters that can be measured in the field, as the model structure is imperfect, calibrated parameter values are sometimes beyond their physically reasonable range, as discussed by Pappenberger and Beven (2006). This is often undesirable, if we seek to make the models more mechanistically descriptive.

We focus our discussion on the carbon use efficiency (CUE) for microbial
growth due to two reasons: (1) the CUE is a fundamental parameter in
microbial soil respiration models, and (2) a physically reasonable range can
estimated for the CUE. The concept of microbial CUE (Allison et al., 2010;
Bradford et al., 2008; Manzoni et al., 2012; Wieder et al., 2013) has been
used to present fundamental microbial processes in recent microbial enzyme
models (Allison et al., 2010; German et al., 2011; Schimel and Weintraub,
2003; Wang et al., 2013). The microbial CUE, which is marked between MIC and

Marginal posterior parameter density of carbon use efficiency (CUE) for the three soil respiration models and eight data models. The yellow shaded areas represent the reasonable physical range of CUE (0–0.6).

Figure 5 plots the CUE posterior marginal density of the three soil respiration models obtained using the eight data models. The physical range between zero and 0.6 is marked in yellow. Figure 5 shows that the CUE posterior parameter distribution of model 6C (obtained using the data models) that does not account for autocorrelation is within the physically reasonable range. For models 4C and 5C, the posterior parameter samples are outside the range for six data models. For model 4C, the posterior parameters are only within the physical range for the SEP and WSEP data models; for model 5C, the two data models are WLS and WSEP. It is not surprising to find the posterior parameter distribution of models 4C and 5C, which have a certain degree of model structure error, to be outside of the physically plausible range. This can be attributed to two reasons. First, the model solution can be biased toward the missing processes in the model structure such as the additional carbon pool in both 4C and 5C or missing the explicit accounting for soil moisture in 4C. Second, biased parameter estimation can compensate for model structure inadequacy and other sources of discrepancy in both the physical models and the data models.

In addition, it is important to understand how accounting for autocorrelation, heteroscedasticity, and non-Gaussian residuals can affect the parameter estimation. First, it is observed in Fig. 5e–h that biased parameter estimates are outside of the physically reasonable range when autocorrelation is explicitly accounted for. This may suggest again that accounting for heteroscedasticity is desirable but accounting for autocorrelation is not. A possible reason is that filtering autocorrelation may reduce the residual space such that the transformed residual space cannot correspond to the parameter space of the models. In other words, parameter information may be lost due to filtering out autocorrelation. However, it is not fully understood why this does not occur for model 6C under data model SLS-AC (Fig. 5e), and more research is warranted. Second, unlike accounting for autocorrelation, accounting for heteroscedasticity alone (i.e., WLS and WSEP) only amplifies or reduces the variance without affecting the structure of the residual space. Figure 5c and d show that accounting for heteroscedasticity (i.e., WLS and WSEP) tends to improve the parameter estimation in comparison with the homoscedastic data models (i.e., SLS and SEP) shown in Fig. 5a and b. Finally, with respect to non-Gaussian residuals, Schoups and Vrugt (2010) suggested that, compared to Gaussian PDF, the peaked PDF of the SEP with a longer tail is useful for making the parameter inference robust against outliers. To a certain degree, this can be substantiated by the results in Fig. 5a–d, in that SEP and WSEP provide more favorable parameter estimates than SLS and WLS.

Finally, Fig. 5a shows that the posterior parameter distributions of SLS are very narrow for the three soil respiration models. The narrow distributions can be attributed to several reasons. As a SEP distribution can have longer tails than a Gaussian distribution, this can further increase the sample's acceptance ratio from tails resulting in a wider distribution (Fig. 5b). In addition, accounting for heteroscedasticity will result in a wider posterior parameter distribution (Fig. 5c) due to accepting higher variances at peak effluxes. Moreover, filtering correlation (Fig. 5e–h) increases the entropy, and leads to wider distributions.

Based on the last one-third of the

In this section the ensemble is generated by running the soil respiration models with the posterior samples (obtained from the Bayesian inference) of the physical model parameters. In other words, the ensemble addresses parametric uncertainty of the soil respiration models only. Considering the relative contribution of parametric uncertainty only will provide insights for modeling approaches that attempt to segregate various sources of uncertainty (e.g., Thyer et al., 2009; Tsai and Elshall, 2013). The four statistics above (i.e., NSME, sharpness, coverage, and RMS) are calculated for the three soil respiration models and the eight data models. Taking the SLS and WSEP-AC data models as examples, Fig. 6 plots the data (for the calibration and cross-validation periods separately) along with the mean and 95 % credible intervals of the prediction ensemble for the three models.

Observation data (blue dots), mean prediction (green line), and
95 % credible intervals (red line) of prediction ensembles for

Figure 6 shows that the data models affect model simulations for all of the models. The statistics, especially the RMS, indicate that WSEP-AC has better predictive performance than SLS. This is most visually obvious for model 6C during the cross-validation period after 330 d, as the prediction ensemble of SLS (Fig. 6k) cannot cover the observations, whereas the prediction ensemble of WSEP-AC can (Fig. 6l). This conclusion that WSEP-AC outperforms SLS agrees with the conclusion drawn from Figs. 3 and 4.

Figure 7 plots the four statistics for all of the soil respiration models and
data models. Figure 7a and b show the predictive performance with respect to
the central mean tendency measured by the NSME for both the calibration and
cross-validation periods, respectively. The results indicate that, under all
data models, the low-fidelity model 4C over-fits the data and results in
biased predictions, in that the NSME values become significantly worse (e.g.,
from 0.6 to

With respect to the parametric uncertainty estimation, Fig. 7c and d show that sharpness generally increases when the three assumptions in the data models are gradually relaxed from SLS to WSEP-AC. This is even more obvious during the validation period. Given that the prediction ensemble does not center on the data, the increasing sharpness is desirable as it improves reliability. This is confirmed by the reliability plots in Fig. 7e and f. The exceptions are once again for SLS-AC and SEP-AC that generally have the lowest coverage.

With respect to the overall predictive performance measured by the RMS, the same variation pattern and exception are also observed in the RMS plots in Fig. 7g and h. This is not surprising because the RMS is the metric that can be used to measure all three criteria (central mean tendency, sharpness, and reliability). As the prediction ensemble is not centered on the data, the sharpness and reliability are the decisive factors for evaluating the predictive performance.

In summary, while it is necessary to account for heteroscedasticity in a data model, caution is needed when accounting for autocorrelation in the manner described in Sect. 2.1. In addition, after comparing the RMS values of the residuals using the Gaussian and SEP distributions, the conclusion is that the SEP distribution outperforms the Gaussian distribution with respect to predictive performance. Finally, uncertainty underestimation is evidenced by the very small predictive coverage. The underestimation of uncertainty for all of the physical models with all of the data model is not unexpected because only parametric uncertainty is considered in this study. Considering the overall predictive uncertainty is the subject of the next section.

The simulated output

We start by undertaking a visual assessment of the predictive performance. Figure 8 is similar to Fig. 6 with the exception that Fig. 8 considers the overall predictive uncertainty (i.e., parametric and output uncertainty), whereas Fig. 6 only considers the parametric uncertainty. Figure 8 reveals a practical observation about accounting for the overall uncertainty using the lumped approach of sampling the data models. For example, Fig. 8b shows that, despite the wide prediction interval of model 4C, the model with significant model structure error cannot capture the birch pulse around day 180. It indicates that properly using a data model for model residuals cannot compensate for significant model structure error.

Observation data (blue dots), mean prediction (green line), and
95 % credible intervals (red line) of prediction ensembles for

Figure 9 plots the four statistics (NSME, sharpness, predictive coverage, and RMS) of the three soil respiration models under the eight data models to assess the predictive performance. With respect to the central mean tendency, the NSME values in Fig. 9a and b are visually the same as those in Fig. 7a and b, indicating that the central mean accuracy under parametric uncertainty is the same as that under predictive uncertainty.

With respect to uncertainty, the values of sharpness and predictive coverage increase substantially (Fig. 9c–f). In particular, Fig. 9e and f show that, except for SLS and SEP, the predictive coverage of the rest of the six data models are close to 100 % for all three soil respiration models, indicating that the prediction intervals cover almost all of the data. This is demonstrated in Fig. 6 for WSEP-AC. Similar to Figs. 7c and d, Figs. 9c and d also show a general pattern where the sharpness increases when the three assumptions in the data models are gradually relaxed from SLS to WSEP-AC. The data models that account for autocorrelation are still the exceptions.

With respect to the overall predictive performance, the RMS values are largely determined by the mean accuracy and sharpness as the predictive coverage is similar for different data models. Figure 9g and h of RMS show that the predictive performance of the four data models that account for autocorrelation is worse than that of the other four data models. This suggests again that one needs to be cautious when building autocorrelation into a data model. This is consistent with the finding of Evin et al. (2013, 2014) that accounting for autocorrelation before accounting for heteroscedasticity or jointly accounting for autocorrelation and heteroscedasticity can result in poor predictive performance. In summary, Fig. 9g and h show that accounting for heteroscedasticity in WLS and WSEP for both the calibration and prediction periods gives the best overall predictive performance, and accounting for autocorrelation without heteroscedasticity in SLS-AC and SEP-AC gives the worst overall predictive performance. Finally, for the three soil respiration models, RMS shows that model 4C has the worst predictive performance for both the calibration and cross-validation data. Generally speaking, the high-fidelity model 6C outperforms model 5C for both the calibration and cross-validation data, which justifies the complexity of model 6C.

To demonstrate the impacts of the data models on the predictive performance of the soil respiration models, Fig. 10 plots the model simulations and predictions given by model 6C during the calibration and cross-validation periods using all the eight data models. Figure 10 is used to investigate predictive performance characteristics of the different data models. By examining the predictive performance of model 6C, specific predictive performance patterns can be identified. Figure 10a–d show that SLS and SEP have similar predictive performance with SEP generally having better predictive performance especially during the validation period. Not accounting for heteroscedasticity will underestimate the predication uncertainty (Fig. 10b, d). This is mainly because the variance of the efflux residuals increases with the magnitude of the carbon effluxes (Fig. 3a); thus, assuming constant variance is not representative. Accordingly, accounting for heteroscedasticity using WLS (Fig. 10e) or WSEP (Fig. 10h) will make the predictions more sensitive to peak carbon effluxes. This will generally improve the predictive coverage on the expense of sharpness and the central mean tendency. While WLS and WSEP have similar predictive performance, WSEP has better central mean tendency and overall predictive performance than WLS. Figure 10i–l show that accounting for autocorrelation using SLS-AC and SEP-AC results in wider uncertainty bands and insensitivity to peak carbon effluxes compared with SLS and SEP (Fig. 10a–d), which may be due to the reduction in the information content of the residuals. This results in the deterioration of the sharpness, the central mean tendency, and the capturing of peak carbon fluxes, especially during the validation period. Figure 10m–p show that accounting for both heteroscedasticity and autocorrelation using WLS-AC and WSEP-AC makes the inference robust against peak carbon effluxes. However, due to the loss of information content, the uncertainty bands are still wider, and the uncertainty becomes overestimated especially during validation period compared with WLS and WSEP (Fig. 10e–h). The results of models 4C and 5C, which are not shown here, also display the same prediction patterns with respect to non-Gaussian residuals, heteroscedasticity, and autocorrelation.

Observation data (blue dots), mean prediction (green line), and
95 % credible intervals (red line) for 6C for the eight likelihood
functions during the calibration period

Finally, we observe in Fig. 10 that the data models that have good overall predictive performance as measured by RMS during the calibration period will maintain this good predictive performance during the validation period. For model 6C, the RMS values for the calibration and validation periods are very well correlated with a correlation coefficient of 0.92. However, we note that for models 4C and 5C the overall predictive performances during the calibration and validation periods are not as well correlated as for 6C, with correlation coefficients of 0.52 for model 4C and 0.61 for model 5C. This suggests that model 6C is more robust than 4C and 5C for forecasting and hindcasting.

Accounting for autocorrelation can lead to biased parameter estimation
(Fig. 5) and poor predictive performance (Fig. 10). Autocorrelated residuals
may be attributed to model discrepancy, as shown in Lu et al. (2013). The
most obvious solution to handle the autocorrelation is to reduce the
autocorrelation by improving the soil respiration model. If model improvement
is difficult for practical reasons, we can improve the data model to better
characterize the autocorrelation. Addressing autocorrelation in a data model
is challenging, as it involves several interlinked factors as follows:

Non-stationarity could be a reason for this problem. By drawing on similarity from surface hydrology, the study of Ammann et al. (2018) suggests that autocorrelated residuals might be attributed to non-stationarity due to wet–dry periods with half-hourly data. Accounting for non-stationarity due to wet–dry periods could better address the problem of autocorrelated residuals (Ammann et al., 2018; T. Smith et al., 2010).

The way that autocorrelation is implemented could have an impact.
Autocorrelation could be directly applied to raw residuals (e.g., Li et
al., 2015), to transformed residuals based on the covariance matrix of
residuals

The autocorrelation model could have an impact. Using an autoregressive model is a popular technique to account for autocorrelated residuals. However, using an autoregressive model with either a joint inversion approach (e.g., this study and Schoups and Vrugt, 2010) or sequential approaches (e.g., Evin et al., 2013, 2014; Lu et al., 2013) removes correlation errors via a filter approach, which can lead to a loss of information content. As this may cause an overcorrection of prediction especially at surge events, Li et al. (2015) developed a restricted autoregressive model to overcome this adverse effect. Other autocorrelation models include the moving average model and the mixed autoregressive-moving averaging model (Chatfield, 2003).

Joint vs. sequential inversion for autocorrelation could have an impact. Sequential inversion approaches include two-step procedures (e.g., Evin et al., 2013, 2014; Lu et al., 2013) or the multi-step procedure (M. Li et al., 2016). These sequential approaches estimate the autoregressive parameters sequentially in a later step after estimating the physical model parameters and other data model parameters. Evin et al. (2013, 2014) used a sequential approach to avoid the interaction between the parameters of the heteroscedasticity model and the autocorrelation model. In addition, the autoregressive model parameters can be deterministically calculated as internal variables of the data model similar to Lu et al. (2013), and not as calibration parameters (e.g., Schoups and Vrugt, 2010; Evin et al., 2013, 2014). While the first step in the sequential approach would avoid the biased parameter estimation (Fig. 10a–d), the second step can still lead a poor predicative performance as we are essentially using a filter approach to remove residual correlation. To address this problem, M. Li et al. (2016) utilizes a multi-step procedure that is based on a Gaussian data model that uses restricted autoregressive model. Generally, the study by Ammann et al. (2018) states that joint inversion is still preferred, and that understanding the conditions under which accounting for autocorrelation can be achieved remains poorly understood.

In parameter estimation and prediction of soil carbon fluxes to
the atmosphere, one often assumes that residuals, which include errors in
observations, model inputs, parameter estimates, and model structures, are
normally distributed, homoscedastic, and uncorrelated. We study these
assumptions by calibrating three soil respiration models, which have varying
degrees of model structure errors. We further explore eight data models that
statistically characterize the residuals; we start with the standard least
squares (SLS) and skew exponential power (SEP) data models that assume
homoscedastic and non-correlated residuals. For these two distributions, we
evaluate six other data models that account for heteroscedasticity (WLS and
WSEP), autocorrelation (SLS-AC and SEP-AC), and joint inversion of
heteroscedasticity and autocorrelation (WLS-AC and WSEP-AC). To our knowledge
this is the first study that provides such a detailed analysis for soil
reparation inverse modeling. We also use three soil respiration models with
different degrees of model fidelity (i.e., model discrepancy) and model
complexity (i.e., number of model parameters) to understand the impact of
model discrepancy on the calibration results under different data models. We
analyze the results with respect to (1) residual characterization,
(2) parameter estimation, (3) predictive performance, and (4) impacts of
model discrepancy. The main findings of this study are summarized as follows:

With respect to residual characterization, residual analysis results suggest that the common assumption of not accounting for heteroscedasticity and residual autocorrelation in the SLS and SEP data models results in the poor characterization of residuals. Explicitly accounting for heteroscedasticity in WLS and WSEP results in the significantly improved characterization of the residuals, and the improvement is larger than that obtained by accounting for both heteroscedasticity and autocorrelation in WSL-AC and WSEP-AC. Accounting for autocorrelation only in SLS-AC and SEP-AC does not significantly improve the characterization of the residuals.

With respect to parameter estimation, the impacts of the data models are evaluated by focusing on the carbon use efficiency (CUE), which is a central parameter in soil respiration modeling. Using SLS yields relatively reasonable posterior parameter distributions for the CUE , yet very narrow posteriors. The SLS-AC, SEP-AC, WLS-AC, and WSEP-AC data models that consider autocorrelation tend to yield CUE estimates that are physically unreasonable. We speculate that filtering residual correlation can affect the mapping of the model physics (as implicitly included in the residuals) into the parameter space, which might result in biased parameter estimates that are physically unreasonable.

With respect to predictive performance, it is measured by four statistical criteria: central mean tendency, sharpness, coverage, and relative model score for both the calibration and the cross-validation periods. Results show that accounting for autocorrelation in SLS-AC, SEP-AC, WLS-AC, and WSEP-AC reduces the predicative performance, such that the predictive performance is inferior to that of SLS in terms of the central mean tendency and overall predictive performance (measured by the relative model score), especially during the cross-validation period. Results also indicate that using the SEP distribution can potentially improve the predictive performance. The same is true for accounting for heteroscedasticity. Using the SEP distribution and accounting for heteroscedasticity (i.e., WSEP) can potentially improve the predictive performance.

With respect to the impact of model discrepancy, the high-fidelity model (6C) gives the best results with respect to parameter estimation and predictive performance. Model 6C generally maintains its superior performance under different data models. This justifies the complexity of model 6C relative to model 5C that has one less carbon pool. Model 4C, with the lowest fidelity, maintains its poor performance for different data models, because the model only has four carbon pools and lacks the explicit representation of soil moisture control.

Based on the empirical findings above, we conclude the following:

Not accounting for heteroscedasticity and autocorrelation using a Gaussian or non-Gaussian data model might not necessarily result in biased parameter estimates or biased predictions with respect to the central mean tendency, but will definitely underestimate uncertainty resulting in lower overall predictive performance.

Using a non-Gaussian data model can improve the parameter estimation and predictive performance with respect to the central mean tendency and the uncertainty quantification.

Accounting for heteroscedasticity improves the uncertainty estimation with respect to reliability at the cost of having a wider predictive interval.

This study confirms other empirical findings and theoretical analyses (Evin et al., 2013, 2014; Li et al., 2015; Ammann et al., 2018) which propose that separately accounting for autocorrelation or jointly accounting for autocorrelation and heteroscedasticity can be problematic. While the reasons remain poorly understood (Ammann et al., 2018), this might be attributed to non-stationarity due to wet–dry periods with half-hourly data (Ammann et al., 2018) or to the method of handling autocorrelation (e.g., Schoups and Vrugt, 2010; Evin et al., 2013, 2014; Lu et al., 2013; M. Li et al., 2015, 2016; Ammann et al., 2018). Further investigation to address autocorrelation in soil respiration modeling is warranted in a future study.

The above conclusions are subject to several limitations. First, the conclusions are specific to the soil respiration models developed and validated for semi-arid savannah landscapes. Performance variations across different soil respiration models with different levels of complexity are possible. Second, the conclusions are conditioned on data that were obtained at half-hourly intervals over a 1-year period. Different conclusions would be possible if the data were thinned to daily or weekly scales or data from longer observation periods were used. Third, our study investigates the effects of the residual assumptions of formal likelihood functions via direct conditioning of the residuals model parameters, yet this can also be undertaken using other approaches such as residuals transformation (Thiemann et al., 2001), autoregressive bias models (Del Giudice et al., 2013), approximate Bayesian computation (Sadegh and Vrugt, 2013), and data assimilation (Spaaks and Bouten, 2013). Comparing different methods for accounting for the residual assumptions are beyond the scope of this work. Fourth, this study focuses on formal Bayesian computation using formal likelihood functions, and comparison with other inference functions such as informal likelihood functions or approximate Bayesian computation is warranted in a future study.

Based on the aforementioned conclusions and limitations, we recommend beginning the calibration of soil respiration models with simple SLS or SEP likelihood function. If the residuals characterization is adequate (e.g., Scharnagl et al., 2011), then the underlying assumptions are met. Otherwise, the complexity of the data model can be increased until satisfactory results are obtained in terms of residuals characterization, posterior parameter estimation, and predictive performance. This is similar to the procedure given in Smith et al. (2015). Although the empirical findings of this study provide general guidelines for data model selection for soil respiration modeling, more comparative studies are needed to validate or refute the findings of this study.

The data, codes, and models used to produce this
paper are available from the corresponding author at mye@fsu.edu. We
cannot publicly share the workflow because the MT-DREAM

The supplement related to this article is available online at:

ASE developed and implemented the code for the eight data models for soil respiration modeling, and prepared the paper with contributions from all co-authors. MY developed the research idea and outline, and supervised the research implementation while ASE was a post-doc at Florida State University. GN developed the soil respiration models. GAB collected and processed the eddy-covariance data used for model calibration.

The authors declare that they have no conflict of interest.

The first two authors were supported by the U.S. Department of Energy grant no. DE-SC0008272. The first author was also partly supported by the U.S. National Science Foundation award no. OIA-1557349. The second author was also partly supported by U.S. Department of Energy grant no. DE-SC0019438 and U.S. National Science Foundation grant no. EAR-1552329. We thank the two anonymous reviewers for providing comments that helped to improve the paper.

This paper was edited by Christoph Müller and reviewed by two anonymous referees.