Improving predictive understanding of Earth system variability and change requires data–model integration. Efficient data–model integration for complex models requires surrogate modeling to reduce model evaluation time. However, building a surrogate of a large-scale Earth system model (ESM) with many output variables is computationally intensive because it involves a large number of expensive ESM simulations. In this effort, we propose an efficient surrogate method capable of using a few ESM runs to build an accurate and fast-to-evaluate surrogate system of model outputs over large spatial and temporal domains. We first use singular value decomposition to reduce the output dimensions and then use Bayesian optimization techniques to generate an accurate neural network surrogate model based on limited ESM simulation samples. Our machine-learning-based surrogate methods can build and evaluate a large surrogate system of many variables quickly. Thus, whenever the quantities of interest change, such as a different objective function, a new site, and a longer simulation time, we can simply extract the information of interest from the surrogate system without rebuilding new surrogates, which significantly reduces computational efforts. We apply the proposed method to a regional ecosystem model to approximate the relationship between eight model parameters and 42 660 carbon flux outputs. Results indicate that using only 20 model simulations, we can build an accurate surrogate system of the 42 660 variables, wherein the consistency between the surrogate prediction and actual model simulation is 0.93 and the mean squared error is 0.02. This highly accurate and fast-to-evaluate surrogate system will greatly enhance the computational efficiency of data–model integration to improve predictions and advance our understanding of the Earth system.

Improving predictive understanding of Earth system variability and change requires data–model integration. For example, Bilionis et al. (2015) improved the Community Land Model (CLM) prediction of crop productivity after model calibration, Müller et al. (2015) improved the CLM prediction of methane emissions after parameter optimization, and Fox et al. (2009) and Lu et al. (2017) improved the terrestrial ecosystem model predictive credibility of carbon fluxes after uncertainty quantification. However, data–model integration methods are usually computationally expensive, involving a large ensemble of model simulations, which prohibits their application to complex Earth system models (ESMs) with lengthy simulation time. To reduce computational costs, surrogate modeling is widely used (Razavi et al., 2012; Gong et al, 2015; Ray et al., 2015; Huang et al., 2016; Lu et al., 2018; Ricciuto et al., 2018). The surrogate model, which is a set of mathematical functions, approximates the actual simulation model based on pairs of simulation model input–output samples and then replaces the simulation model in the data–model integration. As ESM evaluation is expensive, it is desired to use a limited number of ESM simulation samples to build an accurate surrogate. As the surrogate model needs to be calculated many times in data–model integration, it is required to build a fast-to-evaluate surrogate. In this study, we use very few simulation model runs to build an accurate and quickly evaluated surrogate system of a large-scale problem based on advanced machine-learning methods.

In Earth system modeling, we usually need to build a surrogate system of many output variables over large spatial and temporal domains. ESMs tend to be simulated on a regional or global scale with many grid cells for several years, producing a large number of output variables. In addition, ESMs are used to solve versatile scientific problems, so the quantities of interest (QoIs) often change. Moreover, the development of a surrogate requires expensive ESM runs, and a large number of runs are often needed to capture the complex model input–output relationship. Therefore, it is reasonable to build a surrogate system for all possible model outputs to reduce the efforts of rerunning ESMs for a new surrogate development when the QoIs change. In this way, whenever we simulate the outputs in a new site or for additional sites at a different time or for a longer period, we can simply extract the information of interest from the large surrogate system without spending extra efforts to build new surrogates, which significantly reduces computational costs.

Building and evaluating a surrogate system of a large number of model
outputs can be very computationally intensive for almost all the surrogate
methods. Polynomials and artificial neural networks are widely used for
surrogate modeling (Razavi et al., 2012; Viana et al., 2014). Polynomial
methods, such as polynomial regression and radial basis functions, need to
solve polynomial coefficients in the surrogate construction and to calculate
matrix multiplications in the surrogate evaluation. Using a

Reducing the model output dimensions also improves computational efficiency
in the evaluation of the polynomial-based surrogates. For example, evaluating
the third-order polynomial-based surrogate of the model with 40 parameters
and 300 000 outputs for one parameter sample, we need to calculate two
matrix multiplications for which matrix

Surrogate modeling assisted by a neural network (NN) also suffers from high computational costs when applied to a large-scale problem with many QoIs. To approximate a complex ESM with many outputs, a complicated NN with many wide hidden layers is usually needed to capture the complex relationship between the model inputs and outputs because each spatial and temporal output variable is driven by different meteorological forcing such as air temperature, humidity, wind speed, precipitation, and radiation. The full connections between nodes in the input layer and the first hidden layer, between nodes of the hidden layers, and between nodes in the last hidden layer and a large number of nodes on the output layer involve a large number of NN weights and biases that need to be solved. For the same example discussed above, to approximate the model with 40 parameters and 300 000 model outputs, an NN with two hidden layers and each layer having 100 nodes has over 30 million weights and biases. Calculation of these weights and biases requires many samples to train the NN for a good fit. Each training sample involves one model evaluation. However, ESM simulation is time consuming, which usually takes several hours or days and can be up to months or even years. A limited sample size is not enough to train a deep and wide NN for convergence, and a simple NN trained by a small sample size may not capture underlying Earth systems accurately. Thus, reducing model output dimensions is needed to advance NN-based surrogate modeling. A small output size reduces the width of the output layer and also simplifies the relationship between the model inputs and outputs so that a simple NN architecture can be appropriate and a small sample size can be sufficient to accurately train the simple NN. In addition, a simple NN can also be evaluated fast with small weight matrix multiplications.

In this work, we propose using singular value decomposition (SVD) to reduce model output dimensions and to improve the computational efficiency of both building and evaluating the surrogates. ESM outputs usually show periodic changes along time and strong correlations between locations, which promises a fast decay of singular values. So, we can use a small number of singular value coefficients to capture a great amount of output information, enabling a significant output dimension reduction. We use the NN for surrogate modeling because, compared to polynomial methods, NNs have shown less difficulty in fitting highly nonlinear and discontinuous functions that are usually observed in ESM response surfaces. For example, carbon flux state variables, such as gross primary productivity (GPP), are strongly affected by vegetation-related parameters. When the parameter samples cause zero vegetation growth, GPP has zero values, whereas when the parameter samples cause high vegetation growth, GPP has large positive values. This leads to a discontinuous GPP response surface jumping from zeros to nonzeros.

NNs can theoretically fit any functions, but their practical performance strongly depends on the NN's architectures and hyperparameters. An NN has many hyperparameters such as the number of layers, number of nodes in each layer, type of activation functions, and learning rate of the stochastic gradient descent optimization. A slight change in the hyperparameter value can result in dramatically different NN performance. Development of a high-performing NN is time-intensive and usually requires trial-and-error tuning by machine-learning experts. In this work, we use Bayesian optimization techniques to optimize the NN architecture and hyperparameters to produce an accurate NN model for the training data. Bayesian optimization searches the hyperparameter space to iteratively minimize the validation errors of the NN by balancing exploration and exploitation (Shahriari et al., 2016). Research has suggested that Bayesian hyperparameter optimization of NNs is more efficient than manual, random, or grid search with better overall performance on test data and less time required for optimization (Bergstra et al., 2011; Snoek et al., 2012). Bayesian optimization involves a large ensemble of NN fittings, and it is a sequential model-based optimization; thus, fast training of the NN models is important. Our proposed SVD method can simplify the NN architecture to advance the NN training and improve the Bayesian optimization performance.

In this effort, we propose an SVD-enhanced, Bayesian-optimized, and NN-based surrogate method and aim to build an accurate and fast-to-evaluate surrogate system of a large-scale model using few model runs to improve computational efficiency in surrogate modeling and thus advance the data–model integration. We apply the method to a simplified land model in the Energy Exascale Earth System Model (sELM) to improve the model predictive capability of carbon fluxes. We build a surrogate system of 42 660 model output variables, which are annual GPPs at 1422 locations simulated for 30 years. The sELM is a regional-scale terrestrial ecosystem model that simulates terrestrial water, energy, and biogeochemical processes in terrestrial surfaces. Simulation of sELM is important for improving our understanding of ecosystem responses to climate change. However, sELM requires lengthy times for hydrologic and carbon cycle equilibration, and these high computational costs limit the affordable number of simulations in data–model integration, thus resulting in poor model performance. The proposed machine-learning-assisted surrogate method makes sophisticated data–model integration computationally feasible and promises an improvement of the sELM predictions.

The major contributions of this work are the following: (1) using SVD to reduce model output dimensions to improve computational efficiency in both building and evaluating an accurate surrogate of a large-scale ESM; (2) using Bayesian optimization techniques to quickly generate an accurate NN-based surrogate; and (3) applying the proposed method to build a large surrogate system of a regional-scale ESM to advance data–model integration. To our knowledge, the method of using SVD to enhance surrogate modeling is novel and we have not seen the application of Bayesian optimization to improve NN-based surrogates in Earth system modeling.

The paper is organized as follows. In Sect. 2, we first describe the sELM, the model parameters, and the QoIs we build surrogates for; following that, we introduce the SVD, NNs, and Bayesian optimization methods. In Sect. 3, we apply the methods to the sELM and analyze the surrogate accuracy. In Sect. 4, we discuss strategies to improve surrogate accuracy and investigate our method's performance in the application of these strategies. In Sect. 5, we end this paper by drawing our conclusions.

We developed a simplified version of Energy Exascale Earth System (E3SM) land model (ELM), or sELM, to simulate carbon cycle processes relevant for Earth system models in a computationally efficient framework. This framework allows us to perform large regional ensembles that are computationally infeasible using offline land surface models such as ELM. The sELM is a combination of model elements from the Data Assimilation Linked Ecosystem Carbon model (DALEC; Williams et al., 2005) and the Community Land Model version 4.5 (CLM4.5; Oleson and Lawrence, 2013). The sELM consists of five process-based submodels that simulate carbon fluxes between five major carbon pools using 49 overall parameters. Based on previous sensitivity analysis using ELM (Ricciuto et al., 2018), this study considers the most sensitive eight parameters associated with four out of the five submodels. We summarize all five process-based submodels and their interactions below and in Fig. 1.

Schematic of sELM, with processes shown using blue boxes with dependencies on environmental data; eight uncertain parameter inputs are listed in orange ovals, and model state variables are indicated by green shapes. Parameters are input to one or more processes as indicated by blue arrows. Model state variables may be outputs for some processes and input for other processes as indicated by red arrows.

The sELM consists of five major submodels: photosynthesis, autotrophic
respiration, allocation, deciduous phenology, and decomposition.
Photosynthesis is driven by the aggregate canopy model (ACM) from the DALEC,
which itself is calibrated against the soil–plant–atmosphere model (Williams
et al., 2005). ACM predicts GPP as a function of carbon dioxide
concentration, leaf area index, maximum and minimum daily temperature, and
photosynthetically active radiation. Here the GPP predicted by ACM is
modified by BTRAN, which reduces GPP when soil water is insufficient to
support transpiration. Because sELM does not predict soil moisture, BTRAN is
calculated in a full ELM simulation and is fed into sELM as an input. ACM
shares one parameter, the ratio of leaf carbon to nitrogen (

The remaining four submodules are based on ELM. The autotrophic respiration
model computes the growth and maintenance respiration components and is
controlled by four parameters, the

The sELM can simulate several carbon state and flux variables, as shown in
Fig. 1 with green shapes. GPP, which represents the total plant carbon
uptake, is considered in this study. Here we use sELM to predict annual GPP
in deciduous forest systems in the eastern region of the United States for
30 years between 1981 and 2010. The carbon state variables are spun up to steady
state by cycling the GSWP3 input meteorology (Kim, 2017) from 1981–2010 for five cycles, and the
sixth
cycle is used as the output for our surrogate modeling study. The region of
interest covers 1422 land grid cells (locations) as shown in Fig. 2. Given 30
outputs at each location (annual values over 30 years), a total of 42 660
GPP variables are simulated. The model uses one plant functional type, and
phenological drivers such as air temperature, solar radiation, vapor pressure
deficit, and

Locations of interest for which we build surrogates of GPP
(g C m

We consider eight uncertain parameter inputs whose ranges are shown as axis limits. The 20 training and 1000 testing data are randomly drawn from the parameter space.

In this section, we introduce our SVD-enhanced, Bayesian-optimized, and NN-based surrogate methods. We first describe the SVD for reducing data dimensionality, then introduce the NN techniques for building a surrogate model, and last depict the Bayesian optimization algorithm for producing a high-performing NN-based surrogate.

We build a surrogate system of model outputs by fitting a data matrix whose columns are output variables and rows are output samples. For a model with 100 000 output variables, the columns of this matrix span a 100 000-dimensional space. Encoding this matrix on a computer takes quite a lot of memory and evaluating this matrix takes a large number of calculations. We are interested in approximating this matrix with some low-rank matrix but retaining most of its information to reduce data transfer and accelerate matrix calculation.

Singular value decomposition (SVD) decomposes a matrix

In this study, we use SVD to reduce training data dimensions. The training
data matrix

Artificial neural networks (NNs) consist of fully connected hierarchical layers with nodes that can be flexibly used for function approximation (Yegnanarayana, 2009). The first layer is the input layer and each node in the input layer represents one model input variable. The last layer is the output layer and each node in the output layer represents one model output variable. The layers between input and output layers are hidden layers that are used to approximate the relationship between model inputs and outputs. When the relationship is complex, a complicated NN with many wide hidden layers is usually needed. The input layer first assigns model parameter values to its nodes. Then each node in the first hidden layer takes multiple weighted inputs, applies the activation function to the summation of these inputs, and calculates the node's value. Next, the second hidden layer takes the values on the first hidden layer nodes as inputs and calculates its nodes' values in the same way. This process moves forward until we get values of all nodes in the output layer, i.e., obtaining NN predictions for the given model parameter input values. The nodes in each layer are fully connected to all the nodes in its previous and subsequent layers. Each of these connections has an associated weight and bias. A complicated NN results in a large number of weights. By tuning these weights and biases based on some training data, we improve the NN approximation of the underlying simulation model.

NN uses the stochastic gradient descent (SGD) method to optimize its weights and biases (Bottou, 2012). SGD optimizes variables by minimizing some loss function based on the function's gradients to these variables. The loss function is usually defined as the mean squared error (MSE) between the NN predictions and model simulations for the same set of model parameter samples in the training data. SGD iteratively updates the optimized variables at the end of each training epoch. In the process, the learning rate, which specifies how aggressively the optimization algorithm jumps between iterations, greatly affects the algorithm's performance and has to be tuned. A small learning rate will take a long time to reach the optimum, causing a slow convergence, whereas a big learning rate will bounce around the optimum, causing unstable results and a difficult convergence. Using SGD to optimize a complex NN with many weights requires a great amount of computational effort and has difficulty in convergence. First, many training data are required to tune a large number of weights. Small training data can easily cause overfitting; i.e., the NN “perfectly” fits the training data but performs badly on new data, thus deteriorating the NN prediction accuracy. In addition, a large number of weights involves massive matrix calculations in evaluating the loss function, slowing down the training process. Furthermore, a complicated NN has difficulty in convergence and can easily get stuck in local minima. In this work, we use SVD to reduce the model output dimensions so as to decrease the number of nodes in the output layer and simplify the NN architecture, thus reducing the size of the weights, enabling a reasonable NN training from small training data, and ultimately improving the computational efficiency.

NN involves a lot of hyperparameters that dramatically affect its
performance such as the number of layers, the number of nodes in each layer,
and the learning rate of the SGD algorithm. Hyperparameter optimization is
needed to produce a high-performing NN. This requires optimizing an
objective function

Bayesian optimization first prescribes a prior belief over the possible
objective functions and then sequentially updates this prior distribution to
posterior distributions as points are evaluated via Bayesian posterior
updating. The prior and posterior distributions are the probabilistic model
that approximates the unknown objective function we are optimizing. With
this probabilistic model, we can sequentially induce an acquisition function
that leverages the uncertainty in the posterior to guide exploration of new
data points for updating the model. The acquisition function evaluates the
utility of candidate points for the next evaluation of

Dependent on the choice of the probabilistic model, we have different Bayesian optimization algorithms (Shahriari et al., 2016). The Gaussian process approach, using the Gaussian process as a probabilistic model and expected improvement as an acquisition function, has been widely used for parameter optimization (Bardenet and Kegl, 2010; Niranjan et al., 2010). However, this approach has a few disadvantages when applying it to optimize NN hyperparameters. First, it does not work well for categorical variables such as the type of activation functions in NN. Secondly, it selects a new set of parameter points based on the best evaluation data. However, NN usually involves randomization during the training process. So, running NN with the same parameter values can lead to different performance, which suggests that our best point could be just lucky output for the specific setting of randomness. Thirdly, the Gaussian process itself involves several hyperparameters such as the kernel of the covariance function; a good choice of these hyperparameters can significantly affect the optimization, but the selection of them is difficult. Lastly, the calculation of the Gaussian process is rather slow, especially for a large number of parameter searches (Snoek et al., 2012).

In this work, we use a tree-structured Parzen estimator (TPE) for NN hyperparameter optimization (Bergstra, et al., 2013). TPE first performs a few iterations of random search, and then it divides collected parameter points into two groups. The first group contains points that give the best scores after evaluation, which can be the top 10 %–25 % of all the points, and the second group has all other points. Next, TPE finds a set of parameters that are more likely to be in the first group and less likely to be in the second group through the following steps: (1) estimate the likelihood probability for each of the two groups based on Parzen window density estimators (Archambeau et al., 2006); (2) sample a bunch of candidate points using the likelihood probability from the first group; and (3) select the point having the largest probability ratio of being in the first group to the second group as the next iteration point. Lastly, we continue the searching until we hit the maximum evaluation and choose the optimal parameter combination that gives the best NN accuracy on the validation data.

The TPE algorithm exhibits significant improvement over classic hyperparameter optimization methods. TPE works well for all types of NN hyperparameter variables; it considers a set of top parameters to avoid the influence of NN randomization, its implementation is straightforward and has no associated hyperparameters for specification, and the calculation of TPE is computationally fast (Bergstra et al., 2011).

In this section, we present the results of building the surrogate system of 42 660 GPP variables for sELM. First, we demonstrate that our method using SVD can efficiently build and evaluate a large surrogate system by comparing the results with and without application of SVD. We then investigate the influence of NN's architecture on surrogate performance and show that our method using hyperparameter optimization can quickly generate an accurate NN. Last, we evaluate surrogate accuracy on large-scale spatial and temporal GPPs.

We consider three sets of data: the training data for fitting the NN, the validation data to detect overfitting in the NN training and to select the best-performing NN in the hyperparameter optimization, and the test data to evaluate the NN prediction accuracy. Each dataset contains pairs of parameters and GPP samples. The parameter samples are randomly drawn from the parameter space defined in Fig. 3. To assess the effectiveness of our proposed surrogate method for a small dataset, we consider only 20 training data (Fig. 3). The validation data are chosen as 0.3 fractions of the training data. The NN model will not train on the validation data but evaluate the loss function on them at the end of each epoch. In each epoch, the training data are shuffled, and the validation data are always selected from the last 0.3 fraction. Precisely, we only use 14 samples to tune NN weights. Attributed to shuffling, these 14 samples can be a different subset from the 20 training data in each epoch, and thus we sufficiently explore the limited 20 data for building the surrogates. We use 1000 test data (Fig. 3) to evaluate the NN prediction accuracy, which makes a reasonable assessment of our proposed method within an affordable computational cost. Note that the 1000 test data are not needed for building the surrogates but are used to demonstrate the effectiveness and efficiency of our method. When using our method to build the surrogates of the 42 660 GPPs, only 20 sELM model simulations are used.

We define the loss function as the mean squared error (MSE) between the NN predictions and the sELM simulations based on the parameter samples for training. We use Adam algorithm (Kingma and Ba, 2015) for stochastic optimization of NN and run it for 800 epochs to minimize the loss function and update NN weights. Adam has been shown as a superior stochastic optimization algorithm in training NN (Basu et al., 2018). There is no right answer for the optimal number of epochs. A small number of epochs could result in underfitting and a large number of epochs may lead to overfitting. Here we consider a large number of epochs and in the meantime use early stopping to avoid overfitting. During the training, when there is no improvement of loss functions for the validation data in 100 epochs, we stop the training and choose the weights at the epoch resulting in the smallest loss function of the validation data as the optimal weights and the associated NN as the best-trained NN under the given setting.

We then use the trained NN to predict the 1000 test data and compare the
predictions with the corresponding sELM simulation results to evaluate the
NN accuracy. We define two metrics for evaluation: the MSE and the
coefficient of determination. The MSE computes the expected value of the
squared prediction errors; the smaller the MSE value, the better the
prediction. The coefficient of determination, also called

We consider two scenarios when building the surrogate system of the 42 660 GPP outputs: Case I involves building the surrogates of reduced data after SVD, and Case II involves building the surrogates of all GPPs directly. In Case I, we first apply SVD to reduce the training data dimensionality, then build surrogates of the singular value coefficients, and last transfer the surrogate system back to the original QoIs (i.e., the 42 660 GPP variables).

The goal of this study is to develop a surrogate method that builds an accurate surrogate system with small training data to reduce the computational costs of simulating expensive ESMs. To demonstrate the effectiveness and efficiency of our method, we compare the surrogate performance of the two cases in predicting the 1000 test data from two aspects: (1) for the same number of training data, the predictive accuracy of the two surrogates, and (2) the number of training data used to achieve similar predictive accuracy.

Figure 4 shows the singular value decay of decomposition of the training data
matrix having 20 samples and 42 660 GPP variables. The figure indicates that
the singular values decay very fast. The first two singular values drop about
1 magnitude, and the first five singular values can capture 97 % of the
information from the training data matrix. To choose a suitable number of
singular value coefficients (

Singular value decay and the information contained in the first largest singular values. The top five singular values contain 97 % of the information from the training data matrix with 42 660 GPP variables and 20 samples.

Performance of the NNs trained by 20 data with consideration of the different number of singular value coefficients after SVD.

Changes of loss function values along epochs for training and
validation data

In Case I, our method is able to use 20 training data to build a highly
accurate surrogate of 42 660 GPP variables with a small MSE of 0.02 and a
high

To make a fair comparison, we use the same NN architecture in Case II as in
Case I except that the output layer of NN in Case II has all 42 660 GPPs
and the output layer in Case I has only five singular value coefficients.
Figure 6b indicates that the simple NN with 20
hidden nodes is not sophisticated enough to capture the complex relationship
between the eight inputs and 42 660 outputs. As we can see in
Fig. 6b, both training and validation losses
are relatively high, suggesting an underfitting. The validation loss is
always larger than the training loss, suggesting that the fitted NN does not
generalize well and may result in poor performance in predicting new data.
Figure 7 shows

Comparison of NN performance between Case I (building surrogates of
five singular value coefficients after SVD based on 20 training data;

On the other hand, the poor performance in Case II suggests that a wider and
deeper NN is needed when we consider the large outputs directly. We thus
increase the number of nodes in each hidden layer to 100 and use this complex
NN with a total of 200 hidden nodes to approximate the relationship of the
eight inputs and 42 660 outputs in Case II. This complex NN dramatically
increases its number of parameters (including weights and biases) to
4.3 million from 255 in Case I. To fit this wide NN and calibrate its large
parameters, 20 training data are too few to get a reasonable fit. No matter
how we adjust the NN hyperparameters, we cannot get a stable solution in
training. We then increase the number of training data to 50, and Fig. 6c
shows that the increased number of data greatly decreases the training and
validation losses; the validation loss is slightly higher than the training
loss, implying a good fit. Figure 7 indicates that the complex NN with 200
hidden nodes trained by 50 data in Case II significantly improves the
prediction accuracy with the

Our method, in the means of simplifying NN architecture through data compression, not only reduces the number of training data but also decreases the training time. Using 20 data to train a simple NN with 255 parameters, our method takes about 4 s. In comparison, the traditional surrogate method without data compression requires great effort in training the complex NN with 4.3 million parameters. As shown in Fig. 7, Case II takes 270 s to fit the NN based on 50 training data and 967 s for the 200 training data, showing a linear increase in computing time. The long training time leads to high computational costs in NN hyperparameter optimization for which a massive amount of NN training is involved in searching the wide hyperparameter space for a high-performing NN model, as discussed in Sect. 3.2.

NN has a large number of hyperparameters. Here we adjust five hyperparameters
and use Case I to investigate their influence on surrogate prediction
accuracy. The five hyperparameters are the following: the number of hidden
layers (

Different sets of NN hyperparameters result in different

We use TPE to evaluate 100 sets of hyperparameters and the one giving the
best validation score, i.e., the smallest MSE on validation data, is chosen
as the optimal hyperparameter. Results indicate that the combination of

Although NNs perform significantly different with a different combination of
hyperparameters, the TPE algorithm can efficiently find high-performing NNs
based on previous sample information. As shown in Fig. 8, good-performing NNs
prefer simple architectures with two hidden layers; e.g., most blue lines
have

Using only 20 expensive sELM runs, we quickly build an accurate surrogate
system of 42 660 GPPs at 1422 locations for 30 years. Therefore, for a
data–model integration problem with QoIs within given spatial and temporal
ranges, we can directly extract the information of interest from the
surrogate system to advance the analysis. The best-performing NN generated
from our method gives an overall accurate prediction of the 42 660 GPPs with
averaged MSE of 0.02 and

Averaged

Figure 9 shows averaged

Simulations of annual GPPs (g C m

We pick three locations to closely evaluate the surrogate accuracy
(Fig. 9). Location 1046 has the best prediction
with the highest

Figure 11 plots the averaged

Averaged

Building a surrogate of the discontinuous response surface, e.g., vegetation turns from alive to dead as the GPP jumps from nonzero to zero, is a difficulty for almost all the state-of-the-art surrogate methods. Research has shown that NNs, because of their layered architecture and nonlinear activation function, can show a better performance compared to other surrogate approaches (Luo and Lu, 2014; Razavi et al., 2012). To improve the surrogate accuracy for strong nonlinear and discontinuous problems, one strategy is using physics-informed domain decomposition methods to build surrogate models separately in different response surface regimes. This strategy requires the surrogate methods to be strongly connected to the simulation model, and the methods are generally problem-specific, requiring expert interaction. Another strategy is increasing the number of training data to explore complex problems. This strategy requires an increase in computational costs for expensive model simulations. In the following Sect. 4, we investigate these two strategies and discuss their influence on surrogate accuracy.

ESMs are complex, with response surfaces that always display strong nonlinearity and discontinuity, creating a challenge for surrogate modeling. In this section, we consider the strategies of physics-informed learning and an increase in the number of training data to improve surrogate accuracy. We conduct two corresponding experiments to investigate our method's performance in the application of these two strategies. In experiment I, we divide the parameter space into two parts, producing zero GPPs and nonzero GPPs, and we use 20 training data to build surrogates of the 42 660 GPPs in the regime, generating nonzero GPP samples. In experiment II, we build the surrogates of the 42 660 GPPs in the original parameter domain (Fig. 3) but with an increasing number of training data (200 and 1000).

We use the results of Case I as a baseline to investigate our method's
performance in the two experiments. Figure 12 shows averaged

Averaged

Simulations of annual GPPs (g C m

Averaged

Simulations of annual GPPs (g C m

Figure 14 shows averaged

The analysis of the two experiments suggests that our method is data-efficient for continuous problems. To improve the surrogate accuracy in discontinuous and highly nonlinear problems, we can use the physical-informed domain decomposition to focus on the continuous and smooth regions of the response surface. If the discontinuity is the inherent feature of the underlying function for which we need a surrogate, an increase in the number of training data would be a good solution for surrogate accuracy improvement.

Building a surrogate system of many GPP variables over large spatial and temporal domains provides great flexibility and the possibility for subsequent predictive analytics tasks. For example, the surrogate model can be used for analyzing sensitivities of model parameters to any set of spatial and temporal GPP variables, as well as for parameter optimization and uncertainty quantification based on single-site, multiple-site, single-year, or multiple-year GPP observations using any defined objective functions. In addition, with the newly collected observations from additional sites or further time periods, we can use the same surrogate system for analysis as long as the QoIs are within the surrogate simulation ranges. In a future study, we will pursue data–model integration using the constructed surrogate system.

In this work, we develop an SVD-enhanced, Bayesian-optimized, and NN-based surrogate method to improve the computational efficiency of large-scale surrogate modeling to advance model–data integration studies in Earth system model simulations. Our method is data-efficient in that only 20 model simulations are needed to build an accurate surrogate system. This is a promising result because large Earth system model ensembles are always computationally infeasible, and 20 is a reasonable and affordable number of simulations to consider. In addition, our method is general purpose and can be efficiently applied to a wide range of Earth system problems with different spatial scales (local, regional, or global) for different simulation periods. It is supereffective for smooth problems and scaled well for highly nonlinear and discontinuous problems.

We apply our surrogate method to a regional ecosystem model. The results
indicate that using only 20 model runs, we can build an accurate surrogate
system of 42 660 spatially and temporally varied GPPs with the

The sELM is presented in its 1.0 version, which is realized
in the Python language. It is an open-source computer code that can be
accessed freely from

All the data used in this study are model simulation data, which can be generated by running the sELM.

DL developed the methods and carried them out. DR developed the model code and performed the model simulations. DL prepared the paper with contributions from the coauthor.

The authors declare that they have no conflict of interest.

Primary support for this work was provided by the Scientific Discovery through Advanced Computing (SciDAC) program, funded by the U.S. Department of Energy (DOE), Office of Advanced Scientific Computing Research (ASCR) and Office of Biological and Environmental Research (BER). Additional support was provided by BER's Terrestrial Ecosystem Science Scientific Focus Area (TES-SFA) project. The authors are supported by Oak Ridge National Laboratory, which is supported by the DOE under contract DE-AC05-00OR22725.

This paper was edited by David Topping and reviewed by Tianfang Xu and Xiankui Zeng.