We use a variational method to assimilate multiple data streams into the terrestrial ecosystem carbon cycle model DALECv2 (Data Assimilation Linked Ecosystem Carbon). Ecological and dynamical constraints have recently been introduced to constrain unresolved components of this otherwise ill-posed problem. Here we recast these constraints as a multivariate Gaussian distribution to incorporate them into the variational framework and we demonstrate their advantage through a linear analysis. Using an adjoint method we study a linear approximation of the inverse problem: firstly we perform a sensitivity analysis of the different outputs under consideration, and secondly we use the concept of resolution matrices to diagnose the nature of the ill-posedness and evaluate regularisation strategies. We then study the non-linear problem with an application to real data. Finally, we propose a modification to the model: introducing a spin-up period provides us with a built-in formulation of some ecological constraints which facilitates the variational approach.

Carbon is a fundamental constituent of life and understanding its global cycle is a key challenge for the modelling of the Earth system. Through the processes of photosynthesis and respiration, ecosystems play a major role in the carbon cycle and thus in the dynamics of the global climate system. Our knowledge of the biogeochemical processes of ecosystems and an ever-growing amount of Earth observation systems can be combined using inverse modelling strategies to improve model predictions and uncertainty quantification.

The Data Assimilation Linked Ecosystem Carbon (DALEC) model is a simple box model
for terrestrial ecosystems simulating a large range of processes occurring at
different timescales from days to millennia. The work of

As with many inverse problems, assimilating Earth observations into DALEC is an ill-posed problem: the model–observation operator which relates parameters and initial carbon stocks to the observations is rank deficient and not all variables can be estimated, or the model–observation operator is ill-conditioned and small observational noise may lead to a solution we can have little confidence in. Solving the problem amounts first to transforming it into a tractable problem in order to ensure a robust, meaningful and stable solution. This can be achieved by using regularisation techniques; the most popular one involves combining the observations and prior information, assuming it exists, through Bayesian inference. The choice of regularisation method depends on the nature of the problem and on the inverse modelling approach adopted.

So far, off-the-shelf methods such as ensemble Kalman filter (EnKF) and Monte
Carlo Markov Chain (MCMC) were adopted to perform model–data fusion with
DALEC. For its ability to accommodate non-linearity and any kind of
probability distributions, the MCMC method, in the limit of a large number of
samples, may be considered as the gold standard. However, despite being well suited for this type of small-scale problem, the computational complexity of
MCMC method makes it intractable for more complex situations. Here we adopt a
variational approach (4DVAR) where a cost function measuring the mismatch
between the model and observations is minimised using a gradient method based
on the adjoint of the model. At AmeriFlux sites (see

The paper is organised as follows. In Sect. 2 we present DALECv2 and the
observation streams used in this study, review the EDCs introduced in

DALECv2 depicts a terrestrial ecosystem as a set of six carbon pools (labile

DALECv2 links the carbon pools (C) via allocation fluxes (green), litterfall fluxes (red) and decomposition (black). Respiration is represented by the blue arrows. The orange arrow represents the feedback of foliar carbon to gross primary production (GPP).

DALECv2 dynamical variables and parameters with their respective range. The units of the non-dimensionless quantities are given in brackets.

The meteorological drivers are extracted from 0.125

In the remainder of the paper the main focus is on the vector

Over the last decade many inverse modelling studies have used NEE measurements from the FLUXNET network, together with other types of observations when available, to provide information about processes controlled by parameters with respect to which NEE is weakly sensitive. Though it contains an ever-increasing amount of information, the flux tower network only provides sparse coverage of terrestrial ecosystems. On the other hand, despite good spatial and temporal coverage, MODIS LAI monthly mean observations only constrain a limited set of DALECv2 state variables, and additional information is required in order to regularise the ill-posed problem and obtain a meaningful solution.

Additional information can be obtained by imposing priors on the variables or
by adding other observation streams (biomass, soil organic matter, etc.). As
an alternative,

Turnover rate constraints which ensure that turnover rates ratios are
consistent with knowledge of the carbon pools residence times.

where

Root–foliar allocation which allows for a strong correlation between
parameters controlling allocation to foliage and roots.

where the allocation fractions

Root–foliar mean dynamics

where

Yearly carbon pools growth rate is limited to 10 %.

where for each pool

Carbon pools are not expected to show rapid exponential decay;
therefore,
parameter sets are required to satisfy the condition that the half-life
period of carbon pools is more than 3 years.

The trajectory of each carbon pool is approximated using an exponential decay
curve

Carbon pools are expected to be within an order of magnitude of a
steady-state attractor.

where for each of the carbon pools

where

Mean normalised sensitivities (MNS): 100 parameter sets satisfying EDCs
are sampled at the Morgan Monroe State Forest. Parameters are ranked in
decreasing order according to their sensitivity, the blue dots represent the
mean of the MNS (dimensionless quantity), the intervals represent
1

Sensitivity analysis studies how the variations of the output

We denote

We consider the Morgan Monroe State Forest over a 12-year period. We
sample 100 parameter sets satisfying the ecological constraints. For each
parameter set, we compute the MNS for DALEC simulated mean fluxes LAI and NEE. In
Fig.

Here our focus is on the mean of the time series of DALEC fluxes (LAI, NEE)
over a 12-year period. Finer analysis could be carried out by looking
at seasonal aspects of the carbon cycle, identifying what variables are the
most sensitive at certain times of the year, for example as studied in

In this section we introduce concepts and methods that allow for close mathematical scrutiny of inverse problems and we present the variational method that we will apply in the following sections.

A generic inverse problem consists of finding a

Inverse problems are generally presented in a probabilistic framework where most methods can be expressed through a Bayesian formulation. The Bayesian approach provides a full characterisation of all possible solutions, their relative probabilities and uncertainties.

From Bayes' theorem, the probability density function (PDF) of the model state

The first term in the cost function (Eq.

Incorporating the EDCs from an optimisation point of view can be easily
performed by considering an inequality constraint optimisation problem where
we aim at solving

We are seeking a multivariate Gaussian distribution that would encode the
EDCs. At a forest site, we start by sampling the parameter space to obtain an
ensemble of 1000 parameter sets satisfying the EDCs; each parameter set

Using Bayes' theorem we can then write

Distribution and Gaussian fit for EDC

Considerable theoretical insights into the nature of the inverse problem, and
the ill-posedness, can be obtained by studying a linearisation of the
operator

We consider a singular value decomposition of

As an example we consider the problem of assimilating NEE observations into
DALECv2 to estimate model parameters and initial conditions at Morgan Monroe
State Forest. We linearise Eq. (

Singular values of the observability matrix for NEE (log scale).

For a signal-to-noise ratio, namely

Solution of the linearised inverse problem for NEE. The red points represent the observations, the red curve is the true trajectory, the green curve is the trajectory obtained using the unstable solution and the blue curve is obtained using the truncated singular value decomposition (TSVD) solution.

Results of the linear inverse problem showing (1) the solution
components for the least squares solution

To reduce the impact of observational noise on the solution, regularisation
is required. The truncated singular value decomposition (TSVD) is a
simple and popular method for regularisation. TSVD consists of truncating the
pseudo-inverse in Eq. (

In our example with NEE we use Hansen's regularisation tools (see

In the next section we consider the concept of a resolution matrix, which allows for finer analysis of the solution of the linear problem.

As suggested by Eqs. (

While the model resolution matrix allows us to see how a solution strategy
maps the true state variables to the solution of the inverse problem, and to
see how well and how independently the state variables can be recovered, one
also needs to assess the uncertainty of the solution. This can be studied
using the so-called

We now study the model resolution matrix for the LAI observation operator at Morgan Monroe State Forest. In the first instance we will demonstrate the resolution power of the LAI signal without regularisation using the pseudo-inverse as generalised inverse first, and then apply TSVD to show how using the truncated pseudo-inverse affects resolution. In a second case we will study the added value of the EDCs in terms of resolution.

Model resolution matrix for the LAI operator.

As previously, we linearise Eq. (

Diagonal elements (log scale) of the unit covariance matrix for the LAI operator: using the pseudo-inverse shown in green and TSVD shown in yellow.

For the study of the unit covariance matrix we restrict ourselves to the
sensitive variables. This amounts to removing the columns corresponding to
the non-sensitive variables, containing only null elements, from the
observability matrix. The dependency of the solution on observational noise can be studied by
looking at Fig.

As previously, we illustrate a simple regularisation strategy, TSVD, and show
its effects on both resolution and stability. Figure

We now consider the effect of incorporating the static EDCs into the
variational framework in terms of resolution. The static EDCs are given by
the first seven EDCs, the linear problem is then given by

Model resolution matrix for the LAI operator using TSVD with
truncation rank

This example shows clearly the benefit of introducing the static EDCs to help estimate poorly constrained or otherwise undetermined components.

Model resolution matrix for LAI and static EDCs as defined by
Eq. (

We now consider a real experiment at the Morgan Monroe State Forest. At this AmeriFlux site, 12 years of MODIS LAI monthly mean observations from 2001 to 2013, NEE, GPP and thus RESP observations from 2001 to 2005 are available. Our goal is to study two different aspects. The first one is the impact of using multiple data streams: how does it affect uncertainty of the predicted fluxes and how well do we predict non-observed fluxes? The second one is to use the static EDCs and to assess their utility in constraining poorly sensitive variables.

When including all terms the cost function,

We perform six experiments summarised in Table

Experiment set up summary: in Exp. 1 we use LAI and bounds constraints (BDS), in Exp. 2 we use LAI, NEE and BDS and so on.

DALECv2 monthly estimates for LAI and NEE at Morgan Monroe State Forest. The red dots are the observations, the blue trajectories are obtained using the 4DVAR analysis, and the grey trajectories are ensemble runs obtained from a 95 % confidence sample of the posterior PDF.

The results of the experiments are presented in Table

Figures

Costs for the results of the inverse modelling experiments. The last column reports the computation time in seconds for the experiment.

DALECv2 monthly estimates for GPP and RESP at Morgan Monroe State Forest. The red dots are the observations, the blue trajectories are obtained using the 4DVAR analysis, the grey trajectories are ensemble runs obtained from a 95 % confidence sample of the posterior PDF.

Posterior parameter distributions for parameters

In the previous section we used EDCs within 4DVAR and showed their advantage in reducing drastically the uncertainty of otherwise undetermined variables. However, we only included the static EDCs which do not require a model run. As including more EDCs often leads to convergence issues, the solution and its uncertainty become subject to caution.

As shown in

Results of the inverse modelling experiments. The solution components
together with their relative variance, in brackets, are given for each experiment.
The vector

Pseudo-periodical seasonal cycle for DALECv2. Using a given set of
parameters and initial values for

Here we consider ecosystems with no recent major disturbance, where the fast
carbon pools are expected to be close to their pseudo-periodical cycle. To
model these ecosystems, one can either restrict the parameter space by using
the dynamic EDCs, or we can introduce a spin-up period during which the
carbon pools reach their attractor. Given parameters

DALEC-SP offers several advantages: some of the EDCs such as those
controlling the growth and the half-life period of carbon pools are almost
automatically satisfied. This greatly reduces the time required to generate
the PDF

To our knowledge, this paper presents the first application of variational
methods for an inverse modelling experiment using DALEC. Over the last
15 years many studies have validated the use of DALEC together with
various types of data streams to infer ecological parameters at the site
level. However, first ensemble Kalman filter and then Monte Carlo methods
were privileged. At the same time 4DVAR has been successfully used at the global scale to
constrain ecosystem parameters in carbon cycle data assimilation system (CCDAS).
In

The complexity of global-scale experiments still limit the application of
fully non-linear methods such as MCMC. In

As with most variational methods, the analysis and application presented in this paper rely heavily on the possibility of deriving the tangent linear model and its adjoint. DALECv2 was designed to take into account this requirement, in particular by replacing the phenology process of the DALEC deciduous model in order to obtain differentiable processes. The model resolution matrix and the gradient of the cost function, including the additional term encoding the EDCs, are computed using adjoint techniques. Despite the increasing capacities offered by automatic differentiation tools, deriving and maintaining an adjoint code can be a complicated task, and, besides its limiting hypothesis, this is certainly one of the main reason for choosing alternatives to 4DVAR. In a forthcoming paper, we use ensemble methods to approximate the gradient of the cost function and to derive approximate resolution matrices, and the experiments presented in this paper are reproduced. The approach, which no longer requires the adjoint, shows very promising results: firstly in terms of estimating parameters, and secondly in terms of computation time by using graphic processing units (GPUs) to perform massive parallel computations.

Designing a global-scale experiment involving a coupling between DALEC and a
transport model has been considered but is still at an early stage. As
presented in

We used DALECv2 and combined multiple data streams – MODIS monthly LAI and monthly NEE, GPP and RESP at an AmeriFlux site – together with ecological constraints to estimate model parameters and initial conditions and to provide uncertainty characterisation for predicted fluxes. DALECv2 is a simple model. It represents the basic processes at the heart of more sophisticated models of the carbon cycle, and, besides its large modelling skills, its simplicity allows for close mathematical scrutiny. Here we adopted a variational approach where the tangent linear model and its adjoint play a major role in (1) facilitating a linear analysis which allows one to understand the nature of the ill-posed problem and to evaluate strategies to regularise it and (2) finding a posterior distribution for the state variables.

We performed a sensitivity analysis using a direct method that consists of studying the first-order derivatives of the output computed using an adjoint method. A sensitivity analysis is a prerequisite to any work with a model, but there is a paucity of literature on this topic in connection with DALEC. Our analysis reveals generic issues that will be encountered in many inverse modelling strategies. Studying the first-order inverse problem, we discussed how noise affects the stability of the solution and we illustrated a simple regularisation method. We then introduced the notion of a model resolution matrix and showed how this can be used to diagnose the ill-posedness of an inverse problem and evaluate the result of regularisation strategies. While some of our findings may be anticipated in the framework of a simple model, it is important to describe these tools and their interpretation, as similar analyses can be readily applied to a wide range of more complex models.

This study did not aim at providing an exhaustive account on the capability of variational tools or exploring all aspects of the EDCs for the inverse problem for DALEC. The objectives were to use 4DVAR and show that it offers a suitable framework to solve efficiently, robustly and quickly the inverse problem for DALEC, and to present a methodology to analyse some issues that affect most methods based on Bayesian inference.

The model and inversion code, together with the drivers, observational data
and experiment results are available at

The authors declare that they have no conflict of interest.

The authors would like to thank the anonymous referees for their valuable comments which helped to improve the manuscript. This project was funded by the NERC National Centre for Earth Observation, NCEO, UK. We acknowledge US-MMS AmeriFlux site for its data records. Research at the MMSF site was supported by the Office of Science (BER), US Department of Energy, Grant No. DE-FG02-07ER64371. AmeriFlux is funded by the United States Department of Energy (DOE – TES), Department of Commerce (DOC – NOAA), the Department of Agriculture (USDA – Forest Service), the National Aeronautics and Space Administration (NASA) and the National Science Foundation (NSF). We are grateful to J. Exbrayat and A. Bloom for providing us with meteorological drivers, MODIS LAI observations and DALECv2 code. Finally we are grateful to M. Williams, J. Exbrayat, A. Bloom and T. Hill for comments and useful discussions. Edited by: Carlos Sierra Reviewed by: two anonymous referees