Large computer models are ubiquitous in the Earth sciences. These models often have tens or hundreds of tuneable parameters and can take thousands of core hours to run to completion while generating terabytes of output. It is becoming common practice to develop emulators as fast approximations, or surrogates, of these models in order to explore the relationships between these inputs and outputs, understand uncertainties, and generate large ensembles datasets. While the purpose of these surrogates may differ, their development is often very similar. Here we introduce ESEm: an open-source tool providing a general workflow for emulating and validating a wide variety of models and outputs. It includes efficient routines for sampling these emulators for the purpose of uncertainty quantification and model calibration. It is built on well-established, high-performance libraries to ensure robustness, extensibility and scalability. We demonstrate the flexibility of ESEm through three case studies using ESEm to reduce parametric uncertainty in a general circulation model and explore precipitation sensitivity in a cloud-resolving model and scenario uncertainty in the CMIP6 multi-model ensemble.

Computer models are crucial tools for their diagnostic and predictive power and are applied to every aspect of the Earth sciences. These models have tended to increase in complexity to match the increasing availability of computational resources and are now routinely run on large supercomputers producing terabytes of output at a time. While this added complexity can bring new insights and improved accuracy, sometimes it can be useful to run fast approximations of these models, often referred to as surrogates (Sacks et al., 1989). These surrogates have been used for many years to allow efficient exploration of the sensitivity of model output to its inputs (L. A. Lee et al., 2011; Ryan et al., 2018), generation of large ensembles of model realizations (Holden et al., 2014, 2019; Williamson et al., 2013), and calibration of models (Holden et al., 2015a; Cleary et al., 2021; Couvreux et al., 2021). Although relatively common, these workflows invariably use custom emulators and bespoke analysis routines, limiting their reproducibility and use by non-statisticians.

Here we introduce ESEm, a general tool for emulating Earth systems models and a framework for using these emulators, with a focus on model calibration, broadly defined as finding model parameters that produce model outputs compatible with available observations. Unless otherwise stated, model parameters in this context refer to constant, scalar model inputs rather than, e.g. boundary conditions. This tool builds on the development of emulators for uncertainty quantification and constraint in the aerosol component of general circulation models (Regayre et al., 2018; L. A. Lee et al., 2011; Johnson et al., 2020; Watson-Parris et al., 2020) but is applicable much more broadly, as we will show.

Figure 1 shows a schematic of a typical model calibration workflow that ESEm enables, assuming a simple “one shot” design for simplicity. Once the gridded model data have been generated they must be co-located (resampled) onto the same temporal and spatial locations as the observational data that will be used for calibration in order to minimize sampling uncertainties (Schutgens et al., 2016a, b). The Community Intercomparison Suite (CIS; Watson-Parris et al., 2016) is an open-source Python library that makes this kind of operation very simple. The output is an Iris (Met Office, 2020b) cube-like object, a representation of a Climate and Forecast (CF)-compliant NetCDF file, which includes all of the necessary coordinate and metadata to ensure traceability and allow easy combination with other tools. ESEm uses the same representations throughout to allow easy input and output of the emulated datasets, plotting and validation and also allows chaining operations with other related tools such as Cartopy (Met Office, 2020a) and Xarray (Hoyer and Hamman, 2016). Once the data have been read and co-located, they are split into training and validation (and optionally test) sets before performing emulation over the training data using the ESEm interface. This emulator can then be validated and used for inference and calibration.

A schematic of a typical workflow using CIS and ESEm to perform model emulation and calibration. Note that only the locations of the observed data are used for resampling the model data.

Emulation is essentially a multi-dimensional regression problem and ESEm provides three main options for performing these fits – Gaussian processes (GPs), convolutional neural networks (CNNs) and random forests (RFs). Based on a technique for estimating the location of gold in South Africa from sparse mining information known as Kriging and formalized by Matheron (1963), GPs have become a popular tool for non-parametric interpolation and an important tool within the field of supervised machine learning. Kennedy and O'Hagan (2001) first described the use of GPs for the calibration of computer models, which forms the basis of current approaches. GPs are particularly well suited to this task since they provide robust estimates and uncertainties to non-linear responses, even in cases with limited training data. Despite initial difficulties with their scalability as compared to, e.g. neural networks, recent advances have allowed for deeper, more expressive (Damianou and Lawrence, 2013) GPs that can be trained on ever larger volumes of training data (Burt et al., 2019). Despite their prevalent use in other areas of machine learning, CNNs and RFs have not been widely used in model emulation. Here we include both as examples of alternative approaches to demonstrate the flexible emulation interface and to motivate broader usage of the tool. For example, Sect. 5.1 shows the use of a RF emulator for exploring precipitation susceptibility in a cloud-resolving model.

One common use of an emulator is to perform model calibration. By definition, any computer model has a number of inputs and outputs. The model inputs can be high-dimensional boundary conditions or simple scalar parameters, and while large uncertainties can exist in the boundary conditions, our focus here is on the latter. These input parameters can often be uncertain, either due to a lack of physical analogue or lack of available data. Assuming that suitable observations of the model output are available, one may ask which values of the input parameters give the best output as measured against the observations. This model “tuning” is often done by hand, leading to ambiguity and potentially sub-optimal configurations (Mauritsen et al., 2012). The difficulty in this task arises because, while the computer model is designed to calculate the output based on the inputs, the inverse process is normally not possible directly. In some cases, this inverse can be estimated and the process of generating an inverse of the model, known as inverse modelling, has a long history in hydrological modelling (e.g. Hou and Rubin, 2005). The inverse of individual atmospheric model components can be determined using adjoint methods (Partridge et al., 2011; Karydis et al., 2012; Henze et al., 2007), but these require bespoke development and are not amenable to large multi-component models. Simple approaches can be used to determine chemical and aerosol emissions based on atmospheric composition, but these implicitly assume that the relationship between emissions and atmospheric concentration is reasonably well predicted by the model (C. Lee et al., 2011). More generally, attempting to infer the best model inputs to match a given output is variously referred to as “calibration”, “optimal parameter estimation” and “constraining”. In many cases finding these optimum parameters requires many evaluations of the model, which may not be feasible for large or complex models, and thus emulators are used as a surrogate. ESEm provides a number of options for performing this inference, from simple rejection sampling to more complex Markov chain Monte Carlo (MCMC) techniques.

Despite their increasing popularity, no general-purpose toolset exists for model emulation in the Earth sciences. Each project must create and validate their own emulators, with all of the associated data handling and visualization code that necessarily accompanies them. Further, this code remains closed source, discouraging replication and extension of the published work. In this paper we aim not only to describe the ESEm tool but also to elucidate the general process of emulation with a number of distinct examples, including model calibration, in the hope of demonstrating its usefulness to the field. A description of the pedagogical example used to provide context for the framework description is provided in Sect. 2. The emulation workflow and the three models included with ESEm are provided in Sect. 3. We then discuss the sampling of these emulators for inference in Sect. 4, before providing two more specific example uses in Sect. 5 and some concluding remarks in Sect. 6.

While we endeavour to describe the technical implementation of ESEm in general terms, we will refer back to a specific example use case throughout in order to aid clarity. This example case concerns the estimation of absorption aerosol optical depth (AAOD) due to anthropogenic black carbon (BC), which is highly uncertain due to limited observations and estimates of both pre-industrial and present-day biomass burning emissions, and large uncertainties in key microphysical processes and parameters in climate models (Bellouin et al., 2020).

Briefly, the model considered here is ECHAM6.3-HAM2.3 (Tegen et al., 2018;
Neubauer et al., 2019), which calculates the distribution and evolution of
both internally and externally mixed aerosol species in the atmosphere and
their effect on both radiation and cloud processes. We generate an ensemble
of 39 model simulations for the year of 2017 over three uncertain input
parameters: (1) a scaling of the emissions flux of BC by between 0.5 and 2
times the baseline emissions, (2) a scaling on the removal rate of BC
through wet deposition (the main removal mechanism of BC) by between 0.33 and
3 times the baseline values, and (3) a scaling of the imaginary refractive
index of BC (which determines its absorptivity) between 0.2 and 0.8. The
parameter sets are created using maximin Latin hypercube sampling (Morris
and Mitchell, 1995) where the scaling parameters (1 and 2) are sampled from
log-uniform distributions, while the imaginary part of the refractive index
is sampled from a normal distribution centred around 0.7. These parameter
ranges were determined by expert elicitation and designed to cover the
broadest plausible range of uncertainty. Unless otherwise stated, five of the
simulations are retained for testing while the rest are used for training
the emulators (see Sect. 3.1 for more details). The model AAODs are
emulated at their native resolution of approximately 1.8

For simplicity, in this paper we then compare the monthly mean model-simulated aerosol absorption optical depth with observations of the same quantity in order to better constrain the global radiative effect of these perturbations. A full analysis including in-situ compositional and large-scale satellite observations, as well as an estimation of the effect of the constrained parameter space on estimates of effective radiative forcing will be presented elsewhere.

Here we step through each of the emulation and inference procedures used to determine a reduced uncertainty in climate model parameters, and hence AAOD, by maximally utilizing the available observations.

Given the huge variety of geophysical models and their applications and the broad (and rapidly expanding) variety of statistical models available to emulate them, ESEm uses an object-oriented (OO) approach to provide a generic emulation interface. This interface is designed in such a way as to encourage additional model engines, either in the core package through pull requests or more informally as a community resource. The inputs include an Iris cube or xarray DataArray with the leading dimension representing the stack of training samples and any other keyword arguments the emulator may require for training. Using either user-specified or default options for the model hyper-parameters and optimization techniques, the model is then easily fit to the training data and validated against the held-back validation data.

In this section we describe the inputs expected by the emulator and the three emulation engines provided by default in ESEm.

In many circumstances the observations we would like to use to compare and
calibrate our model against are provided on a very different spatial and
temporal sampling than the model itself. Typically, a model might use a
discretized representation of space-time, whereas observations are typically
point-like measurements or retrievals. Naively comparing point observations
with gridded model output can lead to large sampling biases (Schutgens et
al., 2017). By collocating the models and observations (e.g. by using CIS), we can minimize this error. An

In Earth sciences these (resampled) model values are typically very large datasets with many millions of values. With sufficient computing power these can be emulated directly; however, there is often a lot of redundancy in the data due to, e.g. strong spatial and temporal correlations, and this brute-force approach is wasteful and can make calibration difficult (as discussed in Sect. 4). The use of summary statistics to reduce this volume while retaining most of the information content is a mature field (Prangle, 2018) and is already widely used (albeit informally) in many cases. The summary statistic can be as simple as a global weighted average, or it could be an empirical orthogonal function (EOF)-based approach (Ryan et al., 2018). Although some techniques for automatically finding such statistics are becoming available (Fearnhead and Prangle, 2012), this usually requires knowledge of the underlying data, and we leave this step for the user to perform using the standard tools available (e.g. Dawson, 2016) as required.

Once the data have been resampled and summarized they should be split into
training, validation and test sets. The training data are used to fit the
models, while the validation portion of the data are used to measure their
accuracy while exploring and tuning hyper-parameters (including model
architectures). The test data are held back for final testing of the model.
Typically, a

The input parameter space can also be reduced to enable more interpretable and robust emulation (also known as feature selection). ESEm provides a utility for filtering parameters based on the Bayesian (or Akaike) information content (BIC; Akaike, 1974) of the regression coefficients for a lasso least-angle regression (LARS) model, using the scikit-learn implementation. This provides an objective estimate of the importance of the different input parameters and allows removing any parameters that do not affect the output of interest. A complementary approach may be to apply feature importance tests to trained emulators to determine their sensitivity to particular input parameters.

Gaussian processes (GPs) are a popular choice for model emulation due to
their simple formulation and robust uncertainty estimates, particularly in
cases of relatively small amounts of training data. Many excellent texts are
available to describe their implementation and use (Rasmussen and Williams,
2005), and we only provide a short description here. Briefly, a GP is a
stochastic process (a distribution of continuous functions) and can be
thought of as an infinite dimensional normal distribution (hence the name).
The statistical properties of the normal distributions and the tools of
Bayesian inference allow tractable estimation of the posterior distribution
of functions given a set of training data. For a given mean function, a GP
can be completely described by its second-order statistics, and thus the choice
of covariance function (or kernel) can be thought of as a prior over the
space of functions it can represent. Typical kernels include constant,
linear, radial basis function (RBF or squared exponential), and Matérn

A number of libraries are available that provide GP fitting with varying degrees of maturity and flexibility. By default, ESEm uses the open-source GPFlow (Matthews et al., 2017) library for GP-based emulation. GPFlow builds on the heritage of the GPy library (GPy, 2012) but is based on the TensorFlow (Abadi et al., 2016) machine learning library with out-of-the-box support for the use of graphical processing units (GPUs), which can considerably speed up the training of GPs. It also provides support for sparse and multi-output GPs. By default, ESEm uses a zero mean and a combination of linear, RBF and polynomial kernels that are suitable for the smooth and continuous parameter response expected for the examples used in this paper and related problems. However, given the importance of the kernel for determining the form of the functions generated by the GP, we have also included the ability for users to specify combinations of other common kernels and mean functions. For a clear description of some common kernels and their combinations, as well as work towards automated methods for choosing them, see, e.g. Duvenaud (2014). For stationary kernels, GPFlow automatically performs automatic relevance determination (ARD), allowing length scales to be learnt independently for each input dimension. The user is also able to specify which dimensions should be active for each kernel in the case where the input dimension can be reduced (as discussed above).

The framework provided by GPFlow also allows for multi-output GP regression,
and ESEm takes advantage of this to automatically provide regression over
each of the output features provided in the training data. Figure 2 shows
the emulated response from the ESEm-generated GP emulation of absorption
aerosol optical depth (AAOD) using a “Constant

Example emulation of absorption aerosol optical depth (AAOD) for a
given set of three model parameters (broadly scaling emissions of black
carbon, removal of black carbon and the absorptivity of black carbon) as
output by

Through the development of automatic differentiation and batch gradient descent it has become possible to efficiently train very large (millions of parameters), deep (dozens of layers) neural networks (NNs) using large amounts (terabytes) of training data. The price of this scalability is the risk of overfitting and the lack of any information about the uncertainty of the outputs. However, both of these shortcomings can be addressed using a technique known as “dropout” whereby individual weights are randomly set to zero and effectively “dropped” from the network. During training this has the effect of forcing the network to learn redundant representations and reduce the risk of overfitting (Srivastava et al., 2014). More recently it was shown that applying the same technique during inference casts the NN as approximating Bayesian inference in deep Gaussian processes and can provide a well-calibrated uncertainty estimate on the outputs (Gal and Ghahramani, 2015). The convolutional layers within these networks also take into account spatial correlations that cannot currently be directly modelled by GPs (although dimension reduction in the input can have the same effect). The main drawback with a CNN-based emulator is that they typically need a much larger amount of training data than GP-based emulators.

While fully connected neural networks have been used for many years, even in climate science (Knutti et al., 2006; Krasnopolsky et al., 2005), the recent surge in popularity has been powered by the increases in expressibility provided by deep, convolutional neural networks (CNNs) and the regularization techniques (such as early stopping) that prevent these huge models from over-fitting the large amounts of training data required to train them. Many excellent introductions can be found elsewhere, but (briefly) a neural network consists of a network of nodes connecting (through a variety of architectures) the inputs to the target outputs via a series of weighted activation functions. The network architecture and activation functions are typically chosen a priori, and following this the model weights are determined through a combination of back-propagation and (batch) gradient descent until the outputs match (defined by a given loss function) the provided training data. As previously discussed, the random dropping of nodes (by setting the weights to zero), termed dropout, can provide estimates of the prediction uncertainty of such networks. The computational efficiency of such networks and the rich variety of architectures available have made them the tool of choice in many machine learning settings, and they are starting to be used in climate sciences for emulation (Dagon et al., 2020), although the large amounts of training data required have so far limited their use somewhat.

ESEm uses the Keras library (Chollet, 2015) with the TensorFlow back end to provide a flexible interface for constructing and training CNN models, and a simple, fairly shallow architecture is included as an example. This default model takes the input parameters and passes them through an initial fully connected layer before passing through two transposed convolutional layers that perform an inverse convolution and act to “spread out” the parameter information spatially. The results of this default model are shown in Fig. 2c, which shows the predicted AAOD from a specific set of three model parameters. While the emulator clearly has some skill and produces the large-scale structure of the AAOD, the error compared to the full ECHAM-HAM output is larger than the GP emulator at around 10 % of the absolute values. This is primarily due to the limited training data available in this example (34 simulations). In addition, this “simple” network still contains nearly a million trainable parameters, and thus an even simpler network would probably perform better given the linearity of the model response to these parameters.

ESEm also provides the option for emulation with random forests using the open-source implementation provided by scikit-learn. Random forest estimators are comprised of an ensemble of decision trees; each decision tree is a recursive binary partition over the training data, and the predictions are an average over the predictions of the decision trees (Breiman, 2001). As a result of this architecture, random forests (along with other algorithms built on decision trees) have three main attractions. Firstly, they require very little pre-processing of the inputs as the binary partitions are invariant to monotonic rescaling of the training data. Secondly, and of particular importance for climate problems, they are unable to extrapolate outside of their training data because the predictions are averages over subsets of the training dataset. As a result of this, a random forest trained on output from an idealized climate model was shown to automatically conserve water and energy (O'Gorman and Dwyer, 2018). Finally, their construction as a combination of binary partitions lends itself to model responses that might be non-stationary or discontinuous.

These features are of particular importance for problems involving the parameterization of sub-grid processes in climate models (Beucler et al., 2021), and as such, although parameterization is not the purpose of ESEm, we include a simple random forest implementation and hope to build on this in the future.

Having trained a fast, robust emulator this can be used to calibrate our
model against available observations. Generally, this problem involves
estimating the model parameters that could give rise to (or best match) the
available observations. More formally, we can define a model as a function

This inverse is unlikely to be well defined since many different
combinations of parameters could feasibly result in a given output, and thus we
take a probabilistic approach. In this framework we would like to know the
posterior probability distribution of the input parameters:

Depending on the purpose of the calibration and assumptions about the form
of

The simplest ABC approach seeks to approximate the likelihood using only
samples from the simulator and a discrepancy function

In practice, however, the simulator proposals will never exactly match the
observations and we must make a pragmatic choice for both

Framed in this way,

This approach is closely related to the approach of “history matching”
(Williamson et al., 2013) and can be shown to be identical in the case of
fixed

When multiple (

In order to illustrate this approach, we apply AERONET (AErosol RObotic
NETwork) observations of AAOD to the problem of constraining ECHAM-HAM model
parameters as described in Sect. 2. The AERONET sun photometers directly
measure solar irradiances at the surface in clear-sky conditions, and by
performing almucantar sky scans are able to estimate the single-scattering
albedo, and hence AAOD, of the aerosol in its vicinity (Dubovik and King,
2000; Holben et al., 1998). Daily average observations are taken from all
available stations for 2017 and co-located with monthly model outputs using
linear interpolation. Figure 3 shows the posterior distribution for the
parameters described in Sect. 2 if uniform priors are assumed, and a
Gaussian process emulator is calibrated with these observations. Of the
million points sampled from this emulator, 729 474 (73 %) are retained as
being compatible with the observations with

The posterior distribution of parameters representing the plausible space of parameters for the example perturbed parameter ensemble experiment having been calibrated with a GP against observed absorbing aerosol optical depth measurements from AERONET. The diagonal histograms represent marginal distributions of each parameter while the off-diagonal scatterplots represent samples from the joint distributions. The colour represents the (average) emulated AAOD for each parameter combination.

The matrix of implausibilities,

The ABC method described above is simple and powerful but somewhat
inefficient as it repeatedly samples from the same prior. In reality, each
rejection or acceptance of a set of parameters provides us with extra
information about the “true” form of

Given the joint probability distribution described by Eq. (2) and an initial
choice of parameters

In the default implementation of MCMC calibration ESEm uses the
TensorFlow probability implementation of Hamiltonian Monte Carlo (HMC)
(Neal, 2011), which uses the gradient information automatically calculated by
TensorFlow to inform the proposed new parameters

Note that for this implementation the distance metric

The implementation will then return the requested number of accepted samples and report the acceptance rate, which provides a useful metric for tuning the algorithm. It should be noted that MCMC algorithms can be sensitive to a number of key parameters, including the number of burn-in steps used (and discarded) before sampling occurs and the step size. Each of these can be controlled via keyword arguments to the sampler.

This approach can provide much more efficient sampling of the emulator and provide improved parameter estimates, especially when used with informative priors that can guide the sampler.

While ABC and MCMC form the backbone of many parameter estimation techniques, there has been a large amount of research on improved techniques, particularly for complex simulators with high-dimensional outputs. See Cranmer et al. (2020) for an excellent recent review of the state-of-the art techniques, including efforts to emulate the likelihood directly utilizing the “likelihood ratio trick” and even including information from the simulator itself (Brehmer et al., 2020). The sampling interface for ESEm has been designed to decouple the emulation technique from the sampler and enable easy implementation of additional samplers as required.

In order to demonstrate the generality of ESEm for performing emulation and/or inference over a variety of Earth science datasets, here we introduce two further examples.

In this example, we use an ensemble of large-domain simulations of realistic shallow cloud fields to explore the sensitivity of shallow precipitation to local changes in the environment. The simulation data we use for training the emulator is taken from a recent study (Dagan and Stier, 2020a) that performed ensemble daily simulations for a 1-month period during December 2013 over the ocean to the east of Barbados, sampling the variability associated with shallow convection. Each day of the month consisted of two runs, both forced by realistic boundary conditions taken from reanalysis but with different cloud droplet number concentrations (CDNCs) to represent clean and polluted conditions. The altered CDNC was found to have little impact on the precipitation rate in the simulations, and thus we simply treat the CDNC change as a perturbation to the initial conditions and combine the two CDNC runs from each day together to increase the amount of data available for training the emulator. At hourly resolution, this provides 1488 data points.

However, given that precipitation is strongly tied to the local cloud
regime, not fully controlling for cloud regime can introduce spurious
correlations when training the emulator. As such we also filter out all
hours that are not associated with shallow convective clouds. To do this,
we consider domain-mean vertical profiles of total cloud water content
(liquid

As our predictors we choose five representative cloud-controlling factors
from the literature (Scott et al., 2020), namely, in-cloud liquid water path
(LWP), geopotential height at 700 hPa (

We then develop a regression model to predict shallow precipitation as a function of these five domain-mean features using the scikit-learn random forest implementation within ESEm. After validating the model using leave-one-out cross-validation, we then retrain the model using the full dataset and use this model to predict the precipitation across a wide range of values environmental values.

Finally, for the purpose of plotting, we reduce the dimensionality of our
final prediction by averaging over all features excluding LWP and

The mean precipitation emulated by ESEm using a random forest
model trained on the five environmental factors diagnosed from an ensemble
of cloud-resolving models as described in the text. Panel

Figure 4a illustrates how the random forest regression model can capture
most of the variance in shallow precipitation from the cloud-resolving
simulations, with an

While emulators have previously been used to investigate the behaviour of shallow cloud fields in high-resolution models (e.g. using GPs, Glassmeier et al., 2019), this example demonstrates that random forests are another promising approach, particularly due to their extrapolation properties.

The sixth coupled model intercomparison project (CMIP6); (Eyring et al., 2016) coordinates a large number of formal model intercomparison projects (MIPs), including ScenarioMIP (O'Neill et al., 2016), which explored the climate response to a range of future emissions scenarios. While internal variability and model uncertainty can dominate the uncertainties in future temperature responses to these future emissions scenarios over the next 30–40 years, uncertainty in the scenarios themselves dominates the total uncertainty by the end of the century (Hawkins and Sutton, 2009; Watson-Parris, 2021). Efficiently exploring this uncertainty can be useful for policy makers to understand the full range of temperature responses to different mitigation policies. While simple climate models are typically used for this purpose (e.g., Smith et al., 2018; Geoffroy et al., 2013), statistical emulators can also be of use.

Here we provide a simple example of emulating the global mean surface
temperature response to a change in CO

Global mean surface temperature response to a change in aerosol
optical depth (AOD) or cumulative atmospheric CO

Using a MCMC sampler, we are able to generate a joint probability
distribution for the required change in AOD and CO

While more physically interpretable emulators are appropriate for such important estimates, the advantage these statistical emulators have over simple impulse response models, for example, is the ability to generalize to high-dimensional outputs, such as those shown in Fig. 2 (in addition, see, e.g. Mansfield et al., 2020). They can also account for the full complexity of Earth system models and the many processes they represent. This is straightforward to achieve with ESEm.

The joint probability distribution for a change in aerosol optical
depth (AOD) or cumulative atmospheric CO

We present ESEm – a Python library for easily emulating and calibrating
Earth system models. Combined with the popular geospatial libraries Iris and
CIS, ESEm makes reading, collocating and emulating a variety of model and
Earth system data straightforward. The package includes Gaussian process,
neural network and random forest emulation engines, and a minimal, clearly
defined interface allows for simple extension in addition to providing tools for validating
these emulators. ESEm also includes three popular techniques for calibration
(or inference), optimized using TensorFlow to enable efficient sampling of
the emulators. By building on fast and robust libraries in a modular way we
hope to provide a framework for a variety of common workflows.
We have demonstrated the use of ESEm for model parameter constraint and
optimal estimation with a simple perturbed parameter ensemble example. We
have also shown how ESEm can be used to fit high-dimensional response
surfaces over an ensemble of cloud-resolving model simulations in order to
determine the sensitivity of precipitation to environmental parameters in
these simulations. Such approaches can also be useful in marginalizing over
potentially confounding variables in observational data. Finally, we
presented the use of ESEm for the emulation of the multi-model CMIP6
ensemble in order to explore the global mean temperature response to changes
in aerosol loading and CO

There are many opportunities to build on this framework, introduce the latest inference techniques (Brehmer et al., 2020) and to bring this setting of parameter estimation closer to the large body of work in data assimilation. While this has historically focussed on improving estimates of time-varying boundary conditions (the model “state”), recent work has explored using these approaches to concurrently estimate constant model parameters (Brajard et al., 2020). We hope this tool will provide a useful framework with which to explore such ideas.

We strive to ensure reliability in the library through the use of automated unit tests and coverage metrics. We also provide comprehensive documentation and a number of example notebooks to ensure useability and accessibility. Through the use of a number of worked examples we hope also to have shed some light on this at times seemingly mysterious sub-field.

The ESEm code, including that used to generate the plots in this paper is
available here:

The BC PPE data are available here:

DWP designed the package and led its development. AD contributed the precipitation example and random forest module. LD provided the AAOD example and dataset. PS provided supervision and funding acquisition. DWP prepared the manuscript with contributions from all co-authors.

The contact author has declared that neither they nor their co-authors have any competing interests.

While every effort has been made to make this tool easy to use and generally applicable, the example models provided make many implicit (and explicit) assumptions about the functional form and statistical properties of the data being modelled. Like any tool, the ESEm framework can be misused. Users should familiarize themselves with the models being used and consult the many excellent textbooks on this subject if in any doubt as to their appropriateness for the task at hand. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has evolved through numerous projects in collaboration with Ken Carslaw, Lindsay Lee, Leighton Regayre and Jill Johnson, and we thank them for sharing their insights and their R scripts from which this package is inspired. Those previous collaborations were funded by Natural Environment Research Council (NERC) grants NE/G006148/1 (AEROS), NE/J024252/1 (GASSP) and E/P013406/1 (A-CURE), which we gratefully acknowledge.

For this work specifically, Duncan Watson-Parris and Philip Stier acknowledge funding from NERC projects NE/P013406/1 (A-CURE) and NE/S005390/1 (ACRUISE), as well as from the European Union's Horizon 2020 research and innovation programme iMIRACLI under Marie Skłodowska-Curie grant agreement no. 860100. Philip Stier additionally acknowledges support from the ERC project RECAP and the FORCeS project under the European Union's Horizon 2020 research programme with grant agreement nos. 724602 and 821205. Andrew Williams acknowledges funding from the Natural Environment Research Council, Oxford DTP, award NE/S007474/1. Lucia Deaconu acknowledges funding from NERC project NE/P013406/1 (A-CURE).

The authors also gratefully acknowledge useful discussions with Dino Sedjonovic, Shahine Bouabid and Daniel Partridge, as well as the support of Amazon Web Services through an AWS Machine Learning Research Award and NVIDIA through a GPU research grant. We further thank Victoria Volodina and one anonymous reviewer for their thorough and considered feedback that helped improve this paper.

This research has been supported by the Natural Environment Research Council (grant nos. NE/P013406/1, NE/S005390/1, and NE/S007474/1), the European Research Council, H2020 European Research Council (RECAP, grant no. 724602; and FORCeS, grant no. 821205), and the H2020 Marie Skłodowska-Curie Action (grant no. 860100).

This paper was edited by Steven Phipps and reviewed by Victoria Volodina and one anonymous referee.