Over the anthropocene methane has increased dramatically. Wetlands are one of
the major sources of methane to the atmosphere, but the role of changes in
wetland emissions is not well understood. The Community Land Model (CLM) of
the Community Earth System Models contains a module to estimate methane
emissions from natural wetlands and rice paddies. Our comparison of CH

Methane is the second most important greenhouse gas in terms of radiative
forcing

For computing an objective function value, we have to do a computationally
expensive simulation with CLM4.5bgc in order to obtain the methane emission
predictions at each observation site. CLM4.5bgc and related codes are
deterministic models, i.e., the simulated CH

These characteristics of the objective function (computationally expensive, black-box, possibly multi-modal) do not allow the application of a gradient-based optimization algorithm because, on the one hand, the derivatives would have to be computed numerically (which may be inaccurate and requires many expensive function evaluations), and, on the other hand, gradient-based algorithms generally stop at a local minimum if the initial guess is not close to the global minimum.

For calibrating the parameters of other CLM modules, Markov Chain Monte Carlo
(MCMC) methods and Kalman filters have been used in the literature

Other methods that have recently gained interest for parameter tuning are
based on data assimilation (see, for example,

We use surrogate model based global optimization algorithms because they have
been shown to find near-optimal solutions within a few hundred function
evaluations for computationally expensive multimodal black-box problems

Several surrogate model algorithms have been developed in the literature that
use different surrogate model types. The efficient global optimization
algorithm by

The remainder of this paper is organized as follows. In Sect.

We used the Community Land Model Version 4.5 (CLM4.5), a land component of
the Community Earth System Model (CESM)

We selected the latest version of CLM with improved biogeochemistry
(CLM4.5bgc) over CLM4.0-CN. The major improvements in CLM4.5bgc include the
incorporation of vertically resolved soil carbon dynamics, an alternate
decomposition cascade from the Century soil model, and a more detailed
representation of nitrification and denitrification based on the Century
nitrogen model

In previous versions, simulation of ecosystem productivity was too low in
high latitudes and perhaps too high in low latitudes

We used a mechanistic methane emission model, which is a module integrated in
CLM4.5bgc

The model has been compared to the limited site-level observations of methane
emissions (many of the sites have very sparse spatial and temporal data
coverage, and directly measured climate forcing was unavailable at any of the
sites)

Although the land model can be used interactively within CESM, we use it at
specific points driven by appropriate meteorology

In this study we used a total of six natural wetland sites and ten rice paddy
sites (see Tables

The water table depth is one of the critical factors for methane emissions
from natural wetlands because it determines the extent of anoxic and oxic
soil zones where methane is produced and oxidized, respectively

Most of these wetland sites usually have peat soils with varying depths
underlain by mineral soil. We also forced each wetland site with measured pH
and a specific plant functional type (PFT). The PFT reflects the phenological
and physiological characteristics of the vegetation

For the point simulations at the rice paddy sites we only considered the rice
growing season. The flooding and drainage dates are shown in
Table

To bring the terrestrial carbon and nitrogen cycles close to steady-state
conditions, we spun up both wetland and rice paddy sites for 1850 conditions
(atmospheric CO

Additionally, we conducted global simulations of methane emissions from
natural wetlands for 1993–2004. For these simulations, the grid cell
averaged methane emissions were considered which accounts for methane
emissions from both the inundated and non-inundated portion of the grid cell.
Since the CLM4.5 simulated saturated fraction (an index of inundation) was
substantially greater than the estimates from satellite observations and did
not match the spatio-temporal pattern of variability

CH

The goal of our study is to improve the methane emission predictions of
CLM4.5bgc by tuning the methane-related parameters such that the model better
fits the observations. We use the CH

CLM4.5bgc has 21 parameters related to the methane emission predictions. The
parameter names, their upper and lower bounds, and default values are shown
in Table

Optimization problems become increasingly more complex and difficult to solve as the number of parameters increases (curse of dimensionality). Thus, we determine first which of these 21 parameters are the most sensitive and thus the most important for the optimization. By sensitive we refer to parameters that when changed slightly lead to a significant change in emission predictions. Insensitive parameters, on the other hand, can be changed and do not (or comparatively only very mildly) change the emission predictions and can thus be excluded from the optimization, which decreases the problem dimension.

Parameters that are sensitive for most observation sites (out of 16).

We conducted analyses for each observation site in which we investigated to
which of these 21 parameters the methane emission predictions of CLM4.5bgc
are the most sensitive. We altered the value of each parameter

There are several parameters that are relatively important to the sensitivity
test for all 16 observation sites, but there are also parameters that are
important for some locations and less important for others.
Tables

Parameters that are least sensitive for observation sites (out of 16).

Surrogate models are used in optimization algorithms that aim to solve
computationally expensive black-box problems. Surrogate models serve as
computationally cheap approximations of the expensive simulation model

There are different surrogate model types such as radial basis functions
(RBFs)

An RBF interpolant is defined as follows:

Surrogate global optimization algorithms follow in general the steps shown in Algorithm

General surrogate global optimization algorithm

Select points from the variable domain to create an initial experimental design.

Do the expensive objective function evaluations (here the CLM4.5bgc simulations) at the points selected in Step .

Fit the surrogate model (here the RBF model) to the data from Steps and .

Use the information from the surrogate model to select the new evaluation point

Do the expensive evaluation at

Update the surrogate model and go to Step .

Return the best solution found during the optimization.

We use the DYCORS algorithm by

We create a symmetric Latin hypercube initial experimental design with

We use the same criteria as in DYCORS for determining the best candidate
point (using the RBF approximation to predict the objective function values
at the candidate points, compute the distance of the candidate points to the
set of already sampled points, and compute a weighted score of these two
measures where the weights cycle through a predefined pattern). In order to
guarantee that the matrix in Eq. (

In this section we discuss the setup and results of the numerical
experiments. In a first set of experiments (pseudo data case), we generate
synthetic (pseudo) data and treat it as if it were the real measurement data
in order to assess how well our optimization approach performs. For these
experiments we know the optimal solution. In the second set of experiments
(real data case), we use the measured methane emission data and apply the
optimization algorithm. The goal in the second set of experiments is to find
a parameter set that reduces the objective function value (the weighted RMSE
in Eq.

We did experiments with

For each set of experiments we ran the optimization algorithm three times in
order to examine the influence of the random component in the algorithm
(random initial experimental design and random generation of candidate
points). We allowed 800 function evaluations for the five-dimensional problem
and 1000 evaluations for the 11-dimensional problem. The question of how many
function evaluations need to be performed in order to obtain a fixed level of
solution accuracy is problem dependent. For computationally expensive
optimization problems, such as the problem we consider here, the time for
evaluating the objective function and the totally available time for
obtaining a solution usually defines how many evaluations can be done with
any algorithm. Results for many difficult computationally expensive
optimization problems (for example, problems with multiple local minima)
indicate that surrogate global optimization methods can usually obtain more
accurate results compared to non-surrogate methods with the same limited
number of evaluations (see, for example,

The weights

Solving problem (

Progress plot that shows the development of the best objective
function value found vs. the number of function evaluations for the pseudo
data case with

We assessed the performance of the optimization algorithm by investigating
how well the algorithm could find the model parameters that were used for
creating the pseudo data. For this purpose, we ran CLM4.5bgc with default
parameter values

Figure

Default and optimized parameter values of optimization trials T1, T2, and T3 for the five-dimensional pseudo data case. We report four decimal places because the model output is sensitive to very small changes for some variables. Note that we scaled the numbers to the interval [0,1].

Figure

Table

Progress plot that shows the development of the best objective
function value found vs. the number of function evaluations for the pseudo
data case with

Default and optimized parameter values of optimization trial T3 and parameter values for the point CP that was sampled during the same optimization trial and that is closer to the default point, but that has a worse objective function value (11-dimensional pseudo data case). Bold numbers indicate the parameters for which CP is closer to the default value than T3 (but CP has a worse objective function value).

In order to examine the impact of the differences between default and
optimized parameter values on the model prediction, we use the best parameter
vector of each trial and plot the corresponding CH

CLM4.5bgc CH

This result indicates that the calibration problem is not “identifiable” for
all parameter sets, indicating that more than one parameter set can give a
very similar result in terms of the objective function value. For example,
for the model

It would be desirable to have an identifiable model, but the CLM (and
probably other climate modules) have a number of interacting parameters and
multiplicative nonlinearities, and thus there is no guarantee that all
parameters are identifiable. This is reinforced by the data in
Table

Progress plot that shows the development of the best objective
function value found vs. the number of function evaluations for the real
data case with

In the real data case, we use the actual methane emission measurements at
each of the 16 observation sites for computing the objective function value.
Since we only have very few observations for each site and no information
about measurement errors, we did not exclude any of the measurements from the
optimization although there might be outliers. Also for the real data case we
examine the case for

The progress of the development of the objective function value for the three
trials T1, T2, and T3, respectively, is illustrated in
Fig.

The parameter values of the best solutions found in the three trials are
shown in Table

Default and optimized parameter values of optimization trials T1, T2, and T3 for the five-dimensional real data case. Bold indicates optimized parameters that are on (or close to) the variable boundary (all variables are scaled to [0,1]).

CH

Figures

CH

Progress plot that shows the development of the best objective
function value found vs. the number of function evaluations for the real
data case with

Figure

Default and optimized parameter values of optimization trials T1, T2, and T3 for the 11-dimensional real data case. Bold indicates optimized parameters that are on the variable bound (all variables are scaled to [0,1]).

Table

Since all three solutions have approximately the same objective function values, but the points differ greatly, it is an indicator that we either have a multi-modal surface in which some minima assume approximately the same objective function values, or we have a very flat valley in which many points assume similar objective function values. Both possibilities make it very difficult for gradient-based optimization algorithms to find the global optimum. In the first case, the optimization algorithm will get trapped in a local optimum if it is not started close to the global minimum. In the second case, the gradient-based algorithm would require many function evaluations because many steps and gradient computations are necessary due to a very small step size. The surrogate optimization algorithm overcomes this problem.

CH

CH

Table

Unweighted RMSE values for each site using the best parameters found during optimization
trial T1 of the

CH

Figures

Scatterplot showing the mean values of the CH

Average methane emissions (mg CH

The temporal variability in the model's predictions does not necessarily
follow the temporal variability in the observation data (see, for example,
Fig.

Figure

Figure

We simulated CLM4.5bgc to obtain predictions for the CH

Comparison of total methane emissions (Tg CH

Figure

As indicated in the previous section, the observation data drive the model
to predict more CH

The spatial plots of the differences between the average methane emissions
when using default and optimized parameters for the unweighted trial are
shown in panel c of Fig.

In this paper we used a surrogate optimization approach for calibrating the parameters of the methane module of the Community Land Model (CLM4.5bgc). Given only relatively few measurements at 16 observation sites (wetlands and rice paddies) our goal was to explore the use of a surrogate optimization method to improve the model prediction capability in a computationally efficient way by minimizing the root mean squared error between the measurements and the model's predictions. We identified important methane-related parameters in CLM4.5bgc by doing a sensitivity analysis and we were thus able to reduce the problem dimension from 21 to 11. We then used a surrogate optimization approach for tuning the most important parameters in order to solve the problem. We investigated two cases, namely a problem with five of the most important parameters and a problem with all 11 parameters, respectively.

We first used pseudo data in order to asses how well the surrogate
optimization performs and showed that we are able to closely match the pseudo
observations. We were able to reduce the RMSE to less than a fifth within the
first 150 function evaluations for both pseudo data cases. The objective
function was shown to have multiple local minima, which indicates that the
problem is probably not identifiable when 11 parameters were optimized.
Although the RMSE was greatly reduced by the optimization for the 11 parameter pseudo data case, the optimization results did not generate the
same values of the parameters in some cases as were used to generate the
pseudo data. This is a problem with the model, not with the optimization
method used. The multiple local minima detected in
Table

By conducting the simulations globally and comparing the average predicted
emissions with default and optimized parameters, we could show that the total
global CH

However, the
distribution of the predicted emissions between latitudes changed
significantly. The observation data force the optimized model's CH

The methane biogeochemical model used in this study is integrated in the
Community Land Model version 4.5 (CLM4.5), which is the land component of the
Community Earth System Model (CESM,

In the following sections we consider each of these terms in more detail.

Methane production (

We adjusted the fractional inundation in each grid cell to account for a
changing redox potential.

The adjusted fractional inundation

In the non-inundated fraction of a grid cell, we estimated the delay in
methane production as the water table depth increases by estimating an
effective depth below which CH

Additionally, we constrained the methane production using the soil pH
function

We used a scaling factor (

Methane oxidation (

The diffusive transport through aerenchyma

Here, aerenchyma porosity is parameterized based on the plant functional
types (PFTs). A ratio is used to multiply upland vegetation aerenchyma
porosity by comparing to inundated systems:

If the PFT is c3_arctic_grass, c3_nonarctic_grass, or c4_grass, then

A minimum aerenchyma porosity is set to 0.05. Therefore,

The aerenchyma area varies over the course of the growing season. Therefore,
it is parameterized using the simulated leaf area index as

The aerenchyma area

The representation of the ebullition fluxes in the methane model is based
on

Gaseous diffusivity in the soil depends on several factors such as molecular
diffusivity, soil structure, porosity, and organic matter content. The
relationship between effective diffusivity (

Tables

Wetland site data.

Rice paddy site data.

Table

Parameter names, descriptions, ranges, and literature references.

Table

ID, name of observation sites, and associated weights for real data and pseudo data case (Eq. 1 of the main document).

The authors want to acknowledge the funding sources DOE SciDAC DE-SC0006791, NSF 1049031, NSF 1049033, and NSF CISE 1116298. The first author also wants to acknowledge partial support by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract number DE-AC02005CH11231. We thank the anonymous reviewers for their helpful comments and improvement suggestions. Edited by: A. Sandu