Data assimilation methods provide a rigorous statistical framework for constraining parametric uncertainty in land surface models (LSMs), which in turn helps to improve their predictive capability and to identify areas in which the representation of physical processes is inadequate. The increase in the number of available datasets in recent years allows us to address different aspects of the model at a variety of spatial and temporal scales. However, combining data streams in a DA system is not a trivial task. In this study we highlight some of the challenges surrounding multiple data stream assimilation for the carbon cycle component of LSMs. We give particular consideration to the assumptions associated with the type of inversion algorithm that are typically used when optimising global LSMs – namely, Gaussian error distributions and linearity in the model dynamics. We explore the effect of biases and inconsistencies between the observations and the model (resulting in non-Gaussian error distributions), and we examine the difference between a simultaneous assimilation (in which all data streams are included in one optimisation) and a step-wise approach (in which each data stream is assimilated sequentially) in the presence of non-linear model dynamics. In addition, we perform a preliminary investigation into the impact of correlated errors between two data streams for two cases, both when the correlated observation errors are included in the prior observation error covariance matrix, and when the correlated errors are ignored. We demonstrate these challenges by assimilating synthetic observations into two simple models: the first a simplified version of the carbon cycle processes represented in many LSMs and the second a non-linear toy model. Finally, we provide some perspectives and advice to other land surface modellers wishing to use multiple data streams to constrain their model parameters.

The carbon cycle is an important component of the Earth system, especially
when considering the climatic impact of rising greenhouse gas concentrations
from fossil fuel emissions and land use change. It is estimated that the
oceans and land surface absorb approximately half of the CO

Aside from model structural and forcing errors, one source of uncertainty is
related to the parameter (i.e. fixed) values of a model. Model–data fusion,
or data assimilation (DA), allows the calibration, or optimisation, of these
values by minimising a cost function that quantifies the model–data misfit
while accounting for the uncertainties inherent in both the model and data in
a statistically rigorous framework. The C cycle component of most LSMs is
complex and contains a large number of parameters; luckily however, there are
an increasing number of in situ and remote-sensing-based data streams that
can be used for parameter optimisation. These data bring information on
different spatial and temporal scales, such as

atmospheric CO

eddy covariance net CO

satellite-derived measures of vegetation dynamics, including “greenness” indices (i.e. the Normalised Difference Vegetation Index – NDVI), fraction of absorbed photosynthetically active radiation (FAPAR) and leaf area index (LAI): provided at global scales, and up to daily time steps spanning more than a decade, thus capturing IAV and long-term trends (though usually with a trade-off between spatial and temporal resolution);

satellite-derived measurements of soil moisture and land surface temperature: measured at the same temporal and spatial scales as the satellite-derived observations of vegetation dynamics;

aboveground biomass measurements: currently taken at only one or a few points in time at plot scale up to regional scale from aircraft and satellite data, or are estimated from allometric relationships at each site;

soil C stock estimates: usually only taken at one point in time at plot scale;

ancillary data on vegetation characteristics such as tree height or budburst (such data are only measured at certain well-instrumented sites).

This tutorial-style paper highlights some of the challenges of multiple data stream optimisation of carbon cycle models discussed above. Note that we do not aim to explore all possible issues related to a DA system, for example the choice of the cost function, minimisation algorithm, or the characterisation of the prior error distributions; indeed, previous studies have investigated such aspects at length (e.g. Fox et al., 2009; Trudinger et al., 2007), and therefore we refer the reader to these papers for more information. Section 2 reviews recent carbon cycle multiple data stream assimilation studies with reference to some of the aforementioned challenges. Section 3 demonstrates some these issues related to multiple data stream assimilation with synthetic experiments using two simple models: one a simplified version of the carbon dynamics included in many LSMs and the other a “toy” model designed to demonstrate the issues that arise with complex, non-linear models. Finally Sect. 4 provides some advice to land surface modellers wishing to carry out multiple data stream assimilation to constrain the parameters of their model.

Most site-based carbon cycle data assimilation studies have used eddy covariance measurements of NEE and LE fluxes to constrain the relevant parameters of ecosystem models. However, a few studies have also made use of chamber flux soil respiration data and field measurements of vegetation characteristics (e.g. tree height, budburst, LAI) or estimates of litterfall and carbon stocks as ancillary information (e.g. Fox et al., 2009; Keenan et al., 2012; Thum et al., 2016; Van Oijen et al., 2005; Richardson et al., 2010; Williams et al., 2005). Two recent studies combined high-resolution satellite-derived FAPAR data with in situ eddy covariance measurements to optimise parameters related to carbon, water and energy cycles of the ORCHIDEE and BETHY LSMs at a couple of sites (Bacour et al., 2015; Kato et al., 2013, respectively).

At global scales the number of studies that use multiple data streams from
satellites or large-scale networks to optimise LSMs has been increasing in
recent years, although this remains a relatively new area of research.
CCDAS-BETHY was the first global carbon cycle data assimilation system
(CCDAS) to make use of the high-precision measurements of the atmospheric
CO

Three other global CCDASs based on LSMs that are part of Earth system models
(ESMs) have been developed in recent years (Peylin et al., 2016; Raoult et
al., 2016; Schürmann et al., 2016), two of which used multiple data streams as constraints. Schürmann
et al. (2016) optimized model parameters and initial conditions of the land
component, JSBACH (Raddatz et al., 2007), of the MPI ESM (Giorgetta et al.,
2013) using atmospheric CO

Many of the aforementioned studies reported that adding extra data streams
helped to constrain unresolved sub-spaces of the total parameter space.
Richardson et al. (2010) and Keenan et al. (2012) concluded that using
ancillary information (e.g. woody biomass increment, field-based LAI and
chamber measurements of soil respiration), in addition to NEE
data, provided a valuable extra
constraint on many model parameters, which improved both the bias in model
predictions and reduced the associated uncertainties. The results of the
REFLEX model–data fusion inter-comparison project also indicated that
observations of the different carbon pools would help to constrain parameters
such as root allocation and woody turnover that were not well resolved using
NEE and LAI data alone (Fox et al., 2009). Similarly at global scale, Scholze
et al. (2016) found that assimilating SMOS soil moisture data in addition
CO

On the other hand, Williams et al. (2005) observed that one-off, or rarely taken, measurements of carbon stocks were unable to constrain components of the carbon cycle to which they were not directly related. This raises the issue of the relative influence of different data streams in a joint assimilation, particularly if the number of observations for each is vastly different, which will be the case when assimilating both half-hourly C flux data in addition to C stock observations that are typically available at an annual timescale or greater. The spatial distribution of each data stream is also important, especially for heterogeneous landscapes (Barrett et al., 2005; Alton, 2013).

Although a number of multiple data stream assimilation studies exist at various scales, very few studies have specifically investigated the added benefit of different combinations of data streams in a factorial study, with a few notable exceptions (Barrett et al., 2005; Richardson et al., 2010; Kato et al., 2013; Keenan et al., 2013; Bacour et al., 2015; Schürmann et al., 2016). Kato et al. (2013) and Bacour et al. (2015) both evaluated the complementarity of eddy covariance and FAPAR data streams at site level, i.e. the impact of assimilating one individual data stream on the other model state variable, as well as when both data streams were included in the optimisation (see discussion in Sect. 2.2). The study of Keenan et al. (2013) was particularly notable in its aim to quantify which data streams provide the most information (in terms of model–data mismatch) and how many data streams are actually needed to constrain the problem. They reported that of the 17 field-based data streams available, projections of future carbon dynamics were well-constrained with only 5 of the data sources and, crucially, not with eddy covariance NEE measurements alone. These results may be specific to this site or type of ecosystem, but their study highlights the need for further research in this area and, in particular, for synthetic data experiments that allow us to understand which data will be the most useful for a given scientific question. This will also enable researchers to plan more efficient measurement campaigns with experimentalists, as also pointed out by Keenan et al. (2012).

Despite the theoretical benefit of adding data streams into an assimilation system as additional constraints, several of the aforementioned studies at both site and global scale have reported a bias or inconsistency either between the different observation data streams, or between the observations and the model. This is easily detected when the optimisation of one data stream results in a worse fit than the prior in one or more of the other data streams. Thum et al. (2016) found that the addition of aboveground biomass stocks brought a longer-term constraint on allocation parameters, but they noted an incompatibility when assimilating both annual increment and total biomass data to optimise the longer timescale mortality/turnover parameter. This was due to the fact the total stocks take into account losses related to disturbance and management (e.g. canopy thinning) – processes that were not included in that version of the model.

Kato et al. (2013) assimilated SeaWiFS FAPAR (Gobron et al., 2006) and eddy covariance LE measurements at the FLUXNET site in Maun, Botswana. They showed that the individual assimilation of each the two data streams resulted in a perfect (i.e. within the observational uncertainty) fit to the assimilated dataset, but a considerable degradation of the fit to the non-assimilated dataset compared to the prior. A comparison against eddy covariance measurements of gross carbon uptake (gross primary production – GPP) pointed to a bias in the FAPAR data because the fit to the independent GPP data was degraded after assimilating FAPAR data only, while the fit improved after assimilating the LE data only. Nevertheless, the simultaneous assimilation of both data streams achieved a compromise between the two suboptimal results achieved after assimilating only one data stream. The calibration further limited the number of parameters with correlated errors and yielded a higher theoretical reduction in parameter uncertainty and a decrease in the RMSD by 16 % for the GPP data compared to the prior.

Bacour et al. (2015) also noted that when assimilating in situ and satellite-derived FAPAR data and in situ NEE and LE flux data from two French FLUXNET sites into the ORCHIDEE LSM, both separately and together, the posterior parameter values changed significantly for the photosynthesis- and phenology-related parameters, depending on the bias between the model and the observations and the correlation between the parameter errors. When NEE data were assimilated alone there was an even stronger positive bias (model–observations) in the start of leaf onset in the FAPAR data than in the prior simulations, and no improvement in the maximum value. This was likely due to the fact that there were enough degrees of freedom to fit the NEE without changing the phenology-related parameters. Similarly, the fit to the NEE was degraded when the model was only optimized with FAPAR data. The model was able to fit the maximum FAPAR, but this resulted in an adverse effect on the carbon assimilation capacity of the vegetation. The authors argued this was related to incompatibilities between the FAPAR and both the model and NEE measurements, possibly due to the larger spatial footprint of the satellite-derived FAPAR data and/or inaccuracies in the retrieval algorithm. However, given that assimilating in situ FAPAR also degraded the fit to the NEE, they also speculated that the culprit may be an inconsistency between the model and the data due to the different characterisation of FAPAR or LAI in the model compared to the satellite retrieval algorithm. For example, satellite-derived greenness measures (FAPAR/NDVI) also contain information on the non-green elements of vegetation, but the model only simulates green LAI. Furthermore parameters and processes in models have been developed at certain temporal and spatial scales; vegetation is often simply represented as a “big leaf” model in LSMs, taking no account of vertical canopy structure or the spatial heterogeneity in a scene, thus presenting an additional source of inconsistency compared to what is measured. The joint (simultaneous) assimilation of all three data streams in Bacour et al. (2015) reconciled the different sources of information, with an improvement in the model–data fit for NEE, LE and FAPAR. However, the compromise achieved in the joint assimilation was only possible when the FAPAR data were normalised to their maximum and minimum values, which partially accounted for any bias in the magnitude of the FAPAR or inconsistency with the model.

The story of biases and apparent inconsistencies in FAPAR data does not end
there. A bias correction was also necessary in the study by Kaminski et
al. (2012) with CCDAS-BETHY using the MERIS FAPAR product in addition to
atmospheric CO

Resolving these apparent inconsistencies was beyond the scope of most of these studies, aside from applying a bias correction where one was evident. Aside from simple corrections, Quaife et al. (2008) and Zobitz et al. (2014) suggested that LSMs should be coupled to radiative transfer models to provide a more realistic and mechanistic observation operator between the quantities simulated by the model and the raw radiance measured by satellite instruments. This proposition follows experience gained in the case of atmospheric models for several decades (Morcrette, 1991).

The paper by Alton (2013) documents the only previous study to have used a
step-wise assimilation approach with more than two data streams, and they
found that the final parameter values were independent of the order of data
streams assimilated. No studies in the LSM community to date have explicitly
examined a step-wise vs. simultaneous assimilation framework with the same
optimisation system and model. The step-wise assimilation with the
ORCHIDEE-CCDAS detailed in Peylin et al. (2016) has been compared to a
simultaneous optimisation using the same three data streams (as well as the
same model and inversion algorithm) as part of an ongoing study. In the
simultaneous optimisation, the addition of NEE or atmospheric CO

The three sub-sections in Sect. 2 highlight examples within a carbon cycle modelling context of the three main challenges faced when performing a multiple data stream assimilation, namely, (i) the possible negative influence of including additional data streams on other model variables; (ii) the impact of bias in the observations, missing model processes or inconsistency between the observations and model (as discussed in Sect. 2.2); and (iii) the difference between a step-wise and simultaneous optimisation (and the order of data stream assimilation) if the assumptions of the inversion algorithm are violated, which is more likely to be the case with non-linear models when using derivative-based algorithms and least-squares formulation of the cost function (as discussed in Sect. 2.3). The latter point is important because derivative methods (compared to global search) are the only viable option for large-scale, complex LSMs given the time taken to run a simulation. In addition to the above three challenges we have performed a preliminary investigation into the impact of correlated errors between the data streams, which is a topic that has not yet been studied in the context of carbon cycle models to our knowledge.

This section aims to demonstrate these challenges using simple toy models and synthetic experiments where the true values of the parameters are known. Thus the following sections include a description of the toy models together with the derivation of synthetic observations, the inversion algorithm used to optimise the model parameters and the experiments performed, followed by the results for each test case.

To demonstrate the challenges of multiple data stream assimilation in a
carbon cycle context, we have chosen a test model that represents a
simplified version of the carbon cycle dynamics typically implemented in most
LSMs. The model has been well-documented in Raupach (2007) and has been used
previously in the OptIC DA inter-comparison project (Trudinger et al., 2007).
It is based on two equations that describe the temporal evolution (on a daily
time step) of two living biomass (carbon) stores,

Although the simple carbon model contains a non-linear term it is
essentially still a quasi-linear model. In order to illustrate the
challenges associated with multiple data stream data assimilation for more
complex non-linear models, especially when using derivative methods, we
defined a simple non-linear toy model based on two equations with two
unknown parameters:

Most data assimilation approaches follow a Bayesian formalism which, simply
put, allows prior knowledge of a system (in this case the model parameters)
to be updated, or optimised, based on new information (from the
observations). In order to achieve this we define a “cost function” that
describes the misfit between the data and the model, taking into account
their respective uncertainties, as well as the uncertainty on the prior
information. If we follow a Bayesian formalism and least-squares
minimisation approach, and assume Gaussian probability distributions for the
model parameter and observation error variance/covariance, we derive the
following cost function (Tarantola, 1987):

The aim of the inversion algorithm is to find the minimum of this cost
function, thereby achieving the best possible fit between the model
simulations and the measurements, conditioned on their respective
uncertainties and prior information. For cases where there is a strong linear
dependence of the model to the parameters (at least for variations in

Of course no inversion algorithm is perfect, and therefore if the characterisation of the error distribution is inaccurate, or when optimising strictly non-linear models, it is possible that the true “global” minimum of the cost function has not been found. Derivative methods in particular can get stuck in so-called “local minima”, preventing the algorithm from finding the true minimum. To address this issue we carry out a number of assimilations with different random first-guess points in the parameter space. If they all result in the same reduction in cost function value, we can have more confidence that the true minimum has been found.

Once the minimum of the cost function has been found, the posterior parameter
error covariance can be approximated (using the linearity assumption) from
the inverse Hessian of the cost function around its minimum, which is
calculated using the Jacobian of the model at the minimum of

In the step-wise approach each data stream (in our cases

Step 1 is assimilation of the first data stream,

Step 2 is assimilation the second data stream,

Both data streams

The optimisation set-up for both models, including the true
parameter values, their range and the observation uncertainty (1

In this study we used synthetic observations that were generated by running
the model with known (or “true”) parameter values and adding random
Gaussian noise corresponding to the defined observation error for both

The true values of all parameters for both models are given in Table 1,
together with their upper and lower bounds (following Trudinger et al.,
2007). We have not performed a prior sensitivity analysis to decide to which
parameters are important to include in the optimisation, as the model
variables are sensitive to all of the (small set of) parameters. However, in
the case of a more complex, large-scale LSM it is advisable to carry out such
an analysis, particularly given the computational burden of optimising many
parameters. In this study the parameter uncertainty (1

The specific objective of the following experiments was to test the impact
of a bias in the observations that is not accounted for in the

Table 2 details the experiments that were carried out based on all possible
combinations for assimilating the two data streams. Three approaches were
compared: (i) separate – where only one data stream was included in the
optimisation; (ii) step-wise – where each data stream was assimilated
sequentially (both orders:

List of experiments performed for both models with synthetic data. All parameters are optimised in all cases (therefore in both steps for the step-wise approach).

The differences in the parameter values and the theoretical reduction in
their uncertainty (1

In a second stage the impact of an unknown, unaccounted for bias in the
model was examined. This bias could be a systematic bias in the observations
due to the algorithm used for their derivation, the result of missing or
incomplete processes in the model, or an incompatibility between the
observations and the model, for example due to differences in spatial
resolution or an inconsistent characterisation of a variable between the
model and the observations. To test the impact of such an occurrence, we
introduced a constant scalar bias into the modelled

In all experiments for both models
we used 15 iterations of the inversion algorithm, and 20 assimilations were
performed starting from different random “first-guess” points in the
parameter space. As discussed in Sect. 3.1.3, this was done to test the
ability of the algorithm to converge to the global minimum of the cost
function. Note that the global minimum and possible reduction in

For all the above tests we assumed independence (i.e. uncorrelated errors)
for both the observation and parameter prior error covariance matrices; thus
the

Prior and posterior model simulations compared to the synthetic
observations for the simple carbon model for test case 3a for

The 20 random first-guess assimilations were examined for each set of
experiments for both models (before the results for each test were examined
in more detail), in order to check that the algorithm converged to a global
minimum. As shown in the Supplement (Fig. S1), a high proportion of the 20
first-guess assimilations across all test cases for both models resulted in a
similar reduction in

Figures 1a and b show the simple carbon model simulations for test case 3a
(in which both data streams are assimilated simultaneously) for the

Reduction in RMSE for all test cases for simulations with a bias in
the

In Sect. 3.2.1 we saw that there is little difference between a step-wise and
simultaneous optimisation if there is no bias in the model or observations,
and if the model is quasi-linear and therefore the critical assumptions
behind the inversion approach were not violated. However, it is not uncommon
to have a bias between your observations and model that is not obvious and,
therefore, not accounted for in the optimisation, as the cost function used in
most inversion algorithms (and in this study) assumes Gaussian error
distributions with a mean of zero. Note that this is also the case when
defining a likelihood function for accepting or rejecting parameter values in
a global search method. To test the impact of a bias, we added a constant
value to the simulated

The main impact of the bias in the modelled

Even though the posterior parameter values are incorrect, and despite the
fact that the first step results in a degradation, the final reductions in
RMSE are largely the same as the situation with no bias for all variables
when

Posterior parameter values of both the non-linear toy model

Prior and posterior model simulations compared to the synthetic
observations for the non-linear toy model (with no bias) for both the

The analysis of the impact of the bias presented here is specific to this
model and the type and magnitude of the bias that was added, but the broader
findings can be generalised to any situation in which there is a bias or
inconsistency between a model and data that is not accounted for in the
assigned error distributions. Exactly what might constitute a bias or
inconsistency is discussed more in Sect. 2.2. Note also that it is important
to examine the impact on the other variables. For the separate test case 1b
in which only

As discussed in Sect. 3.2.1, there is little difference between the step-wise and the simultaneous assimilation approaches for simple, relatively linear models, unless the observation error (including measurement and model errors) distribution deviates strongly from the Gaussian assumption. However in reality, large-scale, complex LSMs may contain highly non-linear responses to certain model parameters. To demonstrate the impact of non-linearity in a multiple data stream assimilation context we used a non-physically based toy model chosen for its non-linear characteristics (see Sect. 3.1.2).

Figure 4a shows the posterior parameter values for both the

Assimilating each data stream individually (test cases 1a and b)
results neither in an accurate retrieval of the posterior parameters (Fig. 4a) nor in
a strong constraint on either parameter, as shown by the lack of theoretical
reduction in the parameter uncertainty after the optimisation (Fig. 4b).
Despite this, there is a

Only the simultaneous case, in which all

In the simultaneous optimisation in which all observations are included (test
case 3a), the posterior fit to the data dramatically improves for both the

In the second step the optimisation of

The fact that the final reduction in RMSE values after both steps was

Comparing the step-wise cases 2a and b with 2c and d for the non-linear toy
model reveals that neither order in the assimilation,

Reduction in RMSE for all test cases for both

From a mathematical standpoint the most rigorous approach is to propagate the
full parameter error covariance matrices between each step. Without that
constraint not only is information lost in the second step, but the
information contained in the second data stream may have a stronger influence
compared to a simultaneous assimilation, or step-wise case with a propagated
error covariance matrix. The inversion may therefore be more vulnerable to
any strong biases or incompatibilities between the model and the observations
of the second data stream, or indeed the particular sensitivity of its
corresponding model state variable to the parameters. This is one possible
explanation for the degradation seen in

However, the reverse is also true – if the first data stream contains strong
biases, then the associated error correlations will be also propagated with

In a final test we introduced time-invariant correlated noise between the two
data streams (see Sect. 3.1.6). We investigated the impact of ignoring
cross-correlation between two data streams by comparing the results of (i) an
optimisation in which the correlated errors were included in the off-diagonal
elements of the prior observation error covariance matrix,

Median difference (across 20 first-guess parameters) between
including correlated observation errors in the

The presence of correlated errors increases observation redundancy in the
inversion, which would therefore reduce the expected theoretical error
reduction compared to uncorrelated observations (experiments not shown). We
would expect a further limitation on the expected error reduction with a
sub-optimal system, as represented by optimisation (ii) in which there was
cross-correlation between the data streams, but the correlated observation
errors were ignored in the

Figure 7 shows the difference between the two optimisations (i.e. including
off-diagonal elements in the

At low observation error there is no discernible difference between
accounting for the correlated observation errors in the

The key finding of this preliminary investigation into the impact of correlated observation errors is that it becomes increasingly important to properly characterise and account for correlations between data streams if the observations do not contain enough information (i.e. high observation uncertainty or a limited number of observations). However, this is a wide topic that has received little to no attention in the carbon cycle data assimilation literature to date, aside from 2 out of 21 experiments in the wider-ranging study of Trudinger et al. (2007). We therefore suggest that an investigation such as this should be extended in order to fully understand the impact of cross-correlation between data streams; however, this is beyond the scope of this paper.

Although it is clear that in many cases the addition of different
observations in a model optimisation provides additional constraints,
challenges remain that need to be considered. Many of the issues that we have investigated are relevant to any
data assimilation study, including those only using one data stream. However,
most are more pertinent when considering more than one source of data. Based
on the simple toy model results presented here, in addition to lessons
learned from existing studies, we recommend the following points when
carrying out multiple data stream carbon cycle data assimilation
experiments:

If technical constraints require that a step-wise approach be used, it is preferable (from a mathematical standpoint) to propagate the full parameter error covariance matrix between each step. Furthermore, it is important to check that the order of assimilation of observations does not affect the final posterior parameter values, and that the fit to the observations included in the previous steps is not degraded after the final step (e.g. Peylin et al., 2016).

Devote time to carefully characterising the parameter and observation error covariance matrices, including their correlations (Raupach et al., 2005), although we appreciate this is not an easy task (but see Kuppel et al., 2013, for practical solutions). In the context of multiple data stream assimilation, accounting for the error correlations between data streams is increasingly important with higher observation uncertainty (or a limited number of observations). Note that it is not possible to account for error correlations between data streams in a step-wise assimilation.

The presence of a bias in a data stream, or an incompatibility between
the observations and the model, will limit the utility of using multiple
observation types in an assimilation framework. Therefore it is imperative to
analyse and correct for biases in the observations and to determine whether
there is an incompatibility or inconsistency between the model and data.
Alternatively, it may be possible account for any possible bias/inconsistency
in the observation error covariance matrix,

Most optimisation studies with a large-scale LSM require the use of
derivative-based algorithms based on a least-squares formulation of the cost
function and, therefore, rely on assumptions of Gaussian error distributions
and quasi-model linearity. However, if these assumptions are not met it may
not be possible to find the true global minimum of the cost function and the
resultant calculation of the posterior probability distribution will be
incorrect. This is a particular problem if the posterior parameter error
covariance matrix is propagated multiple times in a step-wise approach,
although these issues are relevant to both step-wise and simultaneous
assimilation. Therefore it is important to assess the non-linearity of your
model, and if the model is strongly non-linear, use global search algorithms
for the optimisation – although at the resolution of typical LSM simulations
(

Several diagnostic tests exist to help infer the relative level of constraint brought about by different data streams, including the observation influence and degrees of freedom of signal metrics (Cardinali et al., 2004). Performing these tests was beyond the scope of this study, particularly given that the simple toy models contained so few parameters, but such tests may be instructive when optimising many hundreds of parameters in a large-scale LSM with a number of different data streams. Furthermore, we strongly suggest performing synthetic experiments with pseudo-observations, as in this study, as such tests can help determine the possible constraint brought by different data streams, and the impact of a possible bias and observation or observation–model inconsistency.

Aside from multiple data stream assimilation, other promising directions could also be considered to constrain the problem of lack of information in resolving the parameter space within a data assimilation framework, including the use of other ecological and dynamical “rules” that limit the optimisation (see for example Bloom and Williams, 2015), or the addition of different timescales of information extracted from the data such as annual sums (e.g. Keenan et al., 2012). Finally we should also seek to develop collaborations with researchers in other fields who may have advanced further in a particular direction. Members of the atmospheric and hydrological modelling communities, for example, have implemented techniques for inferring the properties of the prior error covariance matrices, including the mean and variance, as well as potential biases, autocorrelation and heteroscedasticity, by including these terms as “hyper-parameters” within the inversion (e.g. Michalak et al., 2005; Evin et al., 2014; Renard et al., 2010; Wu et al., 2013). Of course this extends the parameter space – making the problem harder to solve unless sufficient prior information is available (Renard et al., 2010), but such avenues are worth exploring.

In this study we have attempted to highlight and discuss some of the challenges associated with using multiple data streams to constrain the parameters of LSMs, with a particular focus on the carbon cycle. We demonstrated some of the issues using two simple models constrained with synthetic observations for which the “true” parameters are known. We performed a variety of tests in Sect. 3 to demonstrate the differences between assimilating each data stream separately, sequentially (in a step-wise approach) and together in the same assimilation (simultaneous approach). In particular we focused on difficulties that may arise in the presence of biases or inconsistencies between the data and the model, as well as non-linearity in the model equations and the importance of accounting of observation error correlations.

Many of the issues faced are inherent to all optimisation experiments, including those in which only one data stream is used. It is of upmost importance to determine whether the observations contain biases and/or whether inconsistencies or incompatibilities exist between the model and the observations, and to correct for this or properly account for this in the error covariance matrices. Secondly it is crucial to understand the assumptions and limitations related to the inversion algorithm used. Without these two points being met, there is a greater risk of obtaining incorrect parameter values, which may not be obvious by examining the posterior uncertainty and model–data RMSE reduction. Furthermore it is more likely that the implementation of a step-wise vs. simultaneous approach will lead to different results. Finally, we note that the consequence of not accounting for cross-correlation between data streams in the prior error covariance matrix becomes more critical with higher observation uncertainty.

This study was not able to examine an exhaustive list of all possible challenges that may be faced when assimilating multiple data streams, but we hope that this tutorial style paper will serve as a guide for those wishing to optimise the parameters of LSMs using the variety of C-cycle-related observations that are available today. We also hope that by increasing awareness about the possible difficulties of model–data integration we can bring the modelling and experimental communities more closely together to focus on these issues.

The model and inversion code is available via the ORCHIDEE LSM Data
Assimilation System (ORCHIDAS) website:

We acknowledge the support from the International Space Science Institute (ISSI). This publication is an outcome of the ISSI's Working Group on “Carbon Cycle Data Assimilation: How to Consistently Assimilate Multiple Data Streams”. Natasha MacBean was also funded by the GEOCARBON Project (ENV.2011.4.1.1-1-283080) within the European Union's 7th Framework Programme for Research and Development. The authors wish to thank collaborators in the atmospheric inversion and carbon cycle DA communities with whom they have had numerous past conversations that have led to an improvement in their understanding of the issues presented here. Finally we thank the two anonymous referees whose comments have helped to improve the clarity and breadth of the manuscript. Edited by: C. Sierra Reviewed by: two anonymous referees