Tuning the ICON-A 2.6.4 climate model with machine-learning-based emulators and history matching

Bonnet, Pauline; Pastori, Lorenzo; Schwabe, Mierk; Giorgetta, Marco; Iglesias-Suarez, Fernando; Eyring, Veronika

doi:https://doi.org/10.5194/gmd-18-3681-2025

Articles | Volume 18, issue 12

https://doi.org/10.5194/gmd-18-3681-2025

Articles | Volume 18, issue 12

Development and technical paper

20 Jun 2025

Development and technical paper |

| 20 Jun 2025

Tuning the ICON-A 2.6.4 climate model with machine-learning-based emulators and history matching

Pauline Bonnet, Lorenzo Pastori, Mierk Schwabe, Marco Giorgetta, Fernando Iglesias-Suarez, and Veronika Eyring

Abstract

In climate model development, “tuning” refers to the important process of adjusting uncertain free parameters of subgrid-scale parameterizations to best match a set of Earth observations, such as the global radiation balance or global cloud cover. This is traditionally a computationally expensive step as it requires a large number of climate model simulations. This step also becomes more challenging with increasing spatial resolution and complexity of climate models. In addition, the manual tuning relies strongly on expert knowledge and is thus not independently reproducible. To reduce subjectivity and computational demands, tuning methods based on machine learning (ML) have become an active research subject. Here, we build on these developments and apply ML-based tuning to the atmospheric component of the Icosahedral Nonhydrostatic Weather and Climate Model (ICON) at 80 km resolution. Our approach follows a workflow similar to other proposed ML-based tuning methods: (1) creating a perturbed parameter ensemble (PPE) of limited size with randomly selected parameters, (2) fitting an ML-based emulator to the PPE to generate a large emulated ensemble with the emulator, and (3) shrinking the parameter space to regions compatible with observations using a method inspired by history matching. However, in contrast to previous works, we apply a sequential approach: the selected set of tuning parameters is updated in successive phases depending on the results of a sensitivity analysis with Sobol indices. We tune for global radiative properties, cloud properties, zonal wind velocities, and wind stresses on the ocean surface. With one iteration of this method, we achieve a model configuration yielding a global top-of-atmosphere net radiation budget in the range of [0, 1] W m⁻², and global radiation metrics and water vapour path consistent with the reference observations. Furthermore, the resulting ML-based emulator allows us to identify the parameters that most impact the outputs that we target with tuning. The parameters that we identified to be mostly influential for the physics output metrics are the critical relative humidity in the upper troposphere and the conversion coefficient from cloud water to rain, influencing the radiation metrics and global cloud cover, together with the coefficient of sedimentation velocity of cloud ice, having a strong non-linear influence on all the physics metrics. The existence of non-linear effects further motivates the use of ML-based approaches for parameter tuning in climate models.

Download & links

Article (PDF, 3941 KB)

Download & links

How to cite.

Received: 08 Aug 2024 – Discussion started: 09 Aug 2024 – Revised: 25 Feb 2025 – Accepted: 06 Mar 2025 – Published: 20 Jun 2025

1 Introduction

Climate and Earth system models are developed and continuously improved to understand the behaviour of the Earth system and to project climate change (Tebaldi et al., 2021). Due to their complexity, as well as constraints on computational resources, the resolution of climate models is relatively coarse so that a number of key processes occur on scales smaller than the model grid scale. These non-resolved processes, such as convection, radiation, turbulence, cloud microphysics, and gravity waves, are described statistically for each grid cell through so-called parameterizations, which are a cause of biases and uncertainties in climate projections (Gentine et al., 2021) due to uncertainties in their formulation and in the selection of the underlying free parameters. To constrain the values of the free parameters involved in the parameterizations, tuning is an important step in the development of climate models (Hourdin et al., 2017), where these parameters are adjusted such that the outputs of the climate model reproduce the observed states of the Earth system reasonably well.

Model tuning is typically a very time-consuming and computationally expensive step. It has to be conducted for all components of a climate model (such as the atmosphere, ocean, and land) and for the coupled model (see, for instance, the tuning of the coupled Icosahedral Nonhydrostatic Weather and Climate Model (ICON) Earth system model by Jungclaus et al., 2022).

Traditionally, tuning in climate models is done manually; i.e. the parameters are changed individually (or a few at a time) in a sequential manner, with expert knowledge guiding the successive choices in the tuning of the parameters (Hourdin et al., 2017; Mauritsen et al., 2012; Schmidt et al., 2017; Giorgetta et al., 2018; Mignot et al., 2021). Such manual approaches may retain some form of subjectivity and are therefore hard to replicate. There is also the risk of neglecting interactions among the processes affected by the changed parameters, which may lead to compensating errors; e.g. a model's low climate sensitivity might be paired with weak aerosol cooling, resulting in an apparent match with historical data but potentially inaccurate future projections (see, for example, Fig. 3 of Hourdin et al., 2017).

In this work we investigate how machine learning (ML) techniques can help in addressing the aforementioned challenges faced in model tuning using the atmospheric component of the ICON model (Giorgetta et al., 2018) as an example. In recent years, ML-based “automatic” tuning methods have been widely investigated. These methods intend to tune the climate models in fewer manual steps for the user compared to fully manual approaches and aim to improve the accuracy and reproducibility of parameter tuning by giving it a mathematical formulation amenable to numerical treatment. The goal is to find the regions of parameter space for which the model outputs are consistent with observation-based reference datasets (see Sect. 2.3), where consistency is defined based on a suitably defined distance between outputs and observations and accounts for a tolerance given by observational uncertainties and model structural errors. A number of mathematical tools have been developed to tackle inverse problems such as model tuning. The one we focus on in this work belongs to the family of Bayesian approaches (this is not the only possible choice; refer to Zhang et al. (2015) for more details on other possibilities). In a Bayesian setting, this is achieved by an iterative and efficient exploration of the space of the parameters being tuned, which is enabled by the construction of an ML-based surrogate or emulator of the climate model that aims to approximate the climate model outputs at much lower computational costs. In its most general formulation, this procedure consists of iterating the following steps: (1) generate a perturbed parameter ensemble (PPE), i.e. an ensemble of climate model simulations obtained by sampling configurations of tuning parameters within the valid parameter ranges; (2) train a computationally cheap ML-based emulator on the PPE output to approximate the parameter-to-output relationship; and (3) use the emulator for a denser sampling of the parameter space, and shrink the space of the allowed parameter configurations to the most promising one, i.e. the parameters most likely to yield a tuned version of the climate model. A commonly adopted method for selecting promising parameter configurations is history matching (Williamson et al., 2013, 2017). History matching aims to minimize the number of required model simulations in the search for acceptable parameters by balancing the sampling of unexplored parameter regions with the sampling close to configurations found to be potentially compatible with observations. This is achieved using a metric that weights both the distance of the emulator predictions from the observational references (small meaning close to observationally compatible configurations) and the uncertainty of the emulator (high in unobserved parameter regions). The three steps described above are repeated until the model outputs used as tuning metrics converge to the corresponding observational range, thus yielding one or multiple tuned parameter configurations or a distribution thereof (Watson-Parris et al., 2021).

Several implementations of the ideas above have been proposed for tuning models of different complexity. History matching has been implemented to constrain parameters in the coupled climate model HadCM3 (Williamson et al., 2013) and to estimate parametric uncertainty in the NEMO ocean model (Williamson et al., 2017). It has also been used to tune parameters of the turbulence scheme of a single-column-model version of ARPEGE-Climat 6.3 using large-eddy simulations as a reference (Couvreux et al., 2021). History matching in combination with single-column models was also employed to constrain convective parameters for their subsequent use in the LMDZ atmospheric model of the IPSL Earth system model (Hourdin et al., 2021). Furthermore, Hourdin et al. (2023) showed another successful application to the IPSL model, finding an ensemble of tuned parameter configurations as good as the manually tuned version, IPSL-CM6A-LR, used for CMIP6. Besides their use in history matching, ML-based emulators also find applications in parameter tuning in combination with ensemble methods (Cleary et al., 2021) (with test applications on Lorenz '63 and '96 models (Cleary et al., 2021), convection schemes in idealized global circulation models (Dunbar et al., 2021), and gravity wave parameterizations (Mansfield and Sheshadri, 2022)) and with approximate Bayesian computation (Watson-Parris et al., 2021).

Building on these previous tuning efforts, here, we design a tuning approach assisted by history matching for the atmospheric component of the Icosahedral Nonhydrostatic Weather and Climate Model (ICON-A version 2.6.4) (ICON, 2015; Zängl et al., 2014). The model's icosahedral grid has a resolution of approximately 80 km (R2B5 grid), offering an improvement in spatial detail compared to previous applications of these tuning approaches in global climate models. For instance, Williamson et al. (2013) used a resolution of 96×73 grid points in latitude and longitude (approximately 417 km × 278 km at the Equator), while Hourdin et al. (2021, 2023) utilized 144×143 grid points (approximately 160 km at the Equator). From an algorithmic perspective, a further distinctive feature of our ICON-A tuning method is that we incorporate history matching in a sequential approach, where we separate tuning into phases in which different sets of tuning parameters are sequentially constrained with history matching. This approach reduces the number of parameters being tuned in each phase and allows us to reduce the required size of the PPEs and, therefore, the computational costs, which is particularly relevant given the total number of tuning parameters and the relatively high resolution (approx. 80 km) we target here. In our sequential approach, we first focus on global radiative and cloud properties, referred to as physics outputs (Giorgetta et al., 2018), and then on outputs related to atmospheric-circulation properties, referred to as dynamics outputs (Giorgetta et al., 2018). For the physics tuning, we apply history matching in the sequential manner explained before and show that the ICON-A physics outputs converge towards observational references in a few iterations. The ML-based tuning of the physics outputs serves as the basis for the second step targeting the dynamics outputs. For this step, we follow the approach of Giorgetta et al. (2018) by generating a PPE and selecting the best-performing model configurations, where our criteria for evaluating the model's performance keep the highest priority on achieving a nearly balanced global annual net radiation flux at the top of the atmosphere (TOA) while aiming to achieve a high performance in terms of the dynamics outputs. Our results are compared to the manually tuned version of the ICON-A model that was presented in Giorgetta et al. (2018) and Crueger et al. (2018), with a grid size of approximately 160 km (R2B4 grid), which is 2 times coarser than the resolution we focus on in this paper (grid size of approximately 80 km, R2B5 grid). In the remainder of the paper, we refer to this manually tuned ICON version as ICON-aes-1.3.

The article is organized as follows. We first introduce the ICON-A model, the ML-based tuning method and the reference datasets used in this study in Sect. 2. We then present the results of the ML-based tuning approach for ICON-A in Sect. 3, an evaluation of our selected runs in Sect. 4, and conclude in Sect. 5, where we also discuss the potential issues of our proposed approach and an outlook on how to possibly overcome them.

2 Methods

2.1 ICON-A modelling framework

The Icosahedral Nonhydrostatic Weather and Climate Model (ICON) is a modelling framework for climate and numerical weather prediction developed jointly by the German Weather Service (DWD) and the Max Planck Institute for Meteorology (MPI-M) (ICON, 2015; Zängl et al., 2014). We use ICON's atmospheric component (ICON-A) (Zängl et al., 2014; Giorgetta et al., 2018), version 2.6.4, and conduct AMIP experiments with the icosahedral grid R2B5 (≈80 km in the horizontal; for details, see Table 1 in Giorgetta et al., 2018) with an implicitly coupled land model. The top height of the atmospheric model is 83 km with 47 full vertical levels and numerical damping starting at 50 km. Subgrid-scale processes are described by parameterizations and include radiative effects, moist convection, vertical diffusion, cloud microphysics, cloud cover, and orographic and non-orographic gravity waves (Giorgetta et al., 2018). The time steps used in the model simulations are 1 h for the radiation scheme and 6 min for the atmospheric scheme. For our PPEs, we run ICON-A for 1 year for spin-up (1979) and then for 1 year for tuning physics outputs (1980). We then run the model for 1 year for spin up (1979) and then for 10 years (1980–1989) for the dynamics outputs, as described in the following sections.

2.2 Parameters and outputs

The first step to ML-based tuning, as for manual tuning, is to select the tuning parameters and output metrics that are to be fitted. Our choice of the metrics is informed by the manual tuning of the ICON model by Giorgetta et al. (2018) and Crueger et al. (2018). There, the authors worked on model versions preceding ICON-aes-1.3, which resulted from their work, with a coarser-resolution R2B4 of ≈160 km; 47 vertical layers, resolving the atmosphere up to a height of 83 km; and time steps of 2 h for the radiation scheme and 10 min for the atmospheric scheme.

Table 1 reports the output metrics and the corresponding reference datasets and values that we focus on in this study, representing global radiative and cloud properties and referred to as the physics outputs. These physics output metrics are all global and multi-year averages. In particular, as shown in Table 1, we use the annual average over 1980 in our PPEs (apart from our last PPE, as discussed later) and compare it with the multi-year averages of the reference datasets.

Giorgetta et al. (2018)Giorgetta et al. (2018)Giorgetta et al. (2018)

Table 1Physics outputs together with the respective observational datasets (CERES EBAF, NASA/LARC/SD/ASDC, 2019; ERA5, Dee et al., 2011; CLARA-AVHRR, Karlsson et al., 2020; and ESA CCI Cloud, Stengel et al., 2017) and target ranges used in this work. All the outputs in this table are globally averaged (for both the reference datasets and the ICON-A simulations we conduct). The averaging period used for both reference datasets and our simulations (PPEs) is reported in the third column. TOA stands for top of the atmosphere.

Download Print Version | Download XLSX

The output metrics related to atmospheric-circulation properties, the dynamics outputs, are given in Table 2. There, the zonal mean velocity at 60° N and S at 10 hPa serves as a proxy for the representation of high-latitude jets. This is a widely used target for evaluating simulations of the polar jets in models resolving the stratosphere (e.g. as seasonal means in Tripathi et al., 2014; Domeisen et al., 2020 a, b; Rao et al., 2020; Baldwin et al., 2021). The surface downward eastward wind stress means over the North Atlantic Ocean and the Southern Ocean (defined in the AR6 database; Iturbide et al., 2020) are proxies for the forcing on the ocean surface. These dynamics output metrics are multi-year averages. In particular, as shown in Table 2, we use the average over the period 1980–1989 in our PPEs and compare it to the multi-year averages of the reference datasets reported in Table 2. We use different averaging periods for physics and dynamics outputs because of the different year-to-year variability and equilibration times of the associated variables. As substantiated in Sect. 3.3.1, the physics outputs have lower year-to-year variability compared to the dynamics ones, meaning that 1 simulated year is sufficient to obtain a representative value for the annual averages. Conversely, for dynamics metrics, the annual averages need to be estimated from multi-year simulations due to their larger variability and sensitivity to geographic patterns.

Table 2Dynamics outputs together with respective observational datasets (ERA5, Hersbach et al., 2020) used in this work. The North Atlantic Ocean (NAO) region and the Southern Ocean (SOO) region are those defined in the AR6 database (Iturbide et al., 2020).

Download Print Version | Download XLSX

Following Giorgetta et al. (2018), the parameterizations we select for tuning the physics outputs are moist convection, vertical diffusion, cloud microphysics, and cloud cover. In Table 3, we report the parameters from these parameterizations (which we refer to as physics parameters) which we select for our tuning experiment. The parameterizations we select for tuning the dynamics outputs are the orographic and non-orographic gravity wave schemes. In Table 4, we report the parameters from these parameterizations (referred to as dynamics parameters) which we select for our tuning experiment.

Table 3Tuning parameters related to physics parameterizations alongside the corresponding name in the ICON source code (second column from left), the range of values tested (third column from left), and the corresponding parameterization scheme they belong to (right column). The range of the parameters was inferred from the default value of the parameters given in the source code of ICON-A version 2.6.4.

Download Print Version | Download XLSX

Table 4Tuning parameters related to dynamics parameterizations alongside the corresponding name in the ICON source code (second column from left), the range of values tested (third column from left), and the corresponding parameterization scheme they belong to (right column). SSO stands for subgrid-scale orography.

Download Print Version | Download XLSX

2.3 Reference datasets

To tune ICON-A, we use reference values for the output metrics from Earth observations and reanalysis data. As in Giorgetta et al. (2018), the main goal here is to obtain a slightly positive global annual mean downward net radiation flux at the top of the atmosphere (TOA), between 0 and 1 W m⁻², based on a net shortwave flux and an outgoing longwave radiation close to observational estimates. For the two radiation fields (rsdt-rsut) and rlut (see Table 1 for definitions), the typical interval [240, 241 W m⁻²] is used as a reference value, as estimated in Giorgetta et al. (2018), following observational datasets (CERES EBAF Ed4.0, 2000–2016) and Kato et al. (2013) and Loeb et al. (2009). For cloud cover, we use CLARA-AVHRR (Karlsson et al., 2020) and ESA CCI CLOUD (Stengel et al., 2017), and for the water vapour path, we use ERA5 (Hersbach et al., 2020) (see Sect. A in the Appendix for time series of these observational datasets). For the dynamics outputs, we use ERA5, ERA-Interim (Dee et al., 2011), and MERRA2 (Gelaro et al., 2017). We refer the reader to Appendix A for the time series of some of the observational products used in this work.

2.4 ML-based tuning approach

Our ML-based tuning method is built on the history matching technique (Williamson et al., 2013, 2017) and follows a similar workflow to that in Couvreux et al. (2021); Hourdin et al. (2021, 2023). The goal is to find a region in the parameter space where the model outputs are compatible (within the observational uncertainty) with the observational data (observationally compatible). In performing this exploration, history matching aims to find a balance between exhaustively exploring or sampling the parameter space and minimizing the number of samples required for it. Since, in our case, each sample corresponds to a computationally expensive climate model simulation, we consider this method to be particularly well suited to our tuning task. In tuning ICON-A, we embed history matching in a sequential protocol, where, at each step, we add or remove tuning parameters based on the outcomes of the history matching iterations. We now start by outlining the steps of the history-matching-inspired method that constitutes the basis of our protocol (see also steps 1 to 4 in Fig. 1).

For a given set of tuning parameters 𝒫 with K elements, draw an initial Latin hypercube (LHC) sampling of size N. Using LHC sampling, all parameters are simultaneously changed, and the different samples fill the K-dimensional parameter space (within the allowed ranges specified in Tables 3 and 4) approximately uniformly. Typically, N is chosen as N≈10 K (Loeppky et al., 2009). Using these selected parameters, generate a PPE of ICON-A runs. The PPE consists of N members or runs, one for each sampled parameter configuration x_i (with i=1, …, N). For each run, we calculate all the output metrics described before. This results in sets of input–output training pairs $T_{Y} = {x_{i}, Y_{model} (x_{i})}_{i = 1, \dots, N}$ , one set per output metric Y (e.g. annual average of global TOA radiation balance).
Fit an emulator to the generated PPE, i.e. to the training sets 𝒯_Y for all the output metrics Y of interest. For a given metric Y, the emulator evaluated based on a configuration of tuning parameters x returns Y_emul(x), the approximation of the true model output metric Y_model(x). Our choice for the model emulator is Gaussian process (GP) regression (Rasmussen and Williams, 2005). GPs are models typically used in Bayesian regression tasks and are very well suited to our case since (i) they have only a few parameters and hence require relatively little training data for fitting, and (ii) they, by construction, return the uncertainty associated with their prediction, which is measured by the variance Var(Y_emu(x)). This is a central quantity used in the steps below. Further details on the choice of the GP are given in Appendix B. In our implementation, we train one GP per model output.
Generate a large emulated metric ensemble of size M (typically ranging from 10⁵ to 10⁶; here, $M = 3 \times 10^{5}$ ) using the trained GP emulator. For each emulator run, calculate the implausibility measure ρ for each metric Y, with reference value Y⁰ (from observations or re-analysis data) as follows:
$\begin{matrix} (1) & ρ (Y_{emul} (x), Y^{0}) = \frac{| Y^{0} - Y_{emul} (x) |}{\sqrt{Var (Y_{emul} (x))}} . \end{matrix}$
The idea behind this definition is that a small distance $| Y^{0} - Y_{emul} (x) |$ or a large emulator variance $\sqrt{Var (Y_{emul} (x))}$ (typically true when x is far from already sampled points) will lead to a small value of ρ, hence balancing exploitation with exploration of the parameter space. Note that, typically, a measure of the observational uncertainty Var(Y⁰) is included in the denominator of the implausibility measure and defines a tolerance for assessing the convergence of history matching. This is an important distinction between traditional history matching and our implementation, which we motivate in the next point. In our case, the observational uncertainty is accounted for in the evaluation of the tuned model configurations, where we assess whether the outputs of the parameter configurations sampled with our procedure (see next points) are within the spread of the observational datasets used as the reference. This is explained in Sect. 4.
Select N parameter configurations that satisfy the following constraints on the outputs (see Tables 1 and 2 for output definitions):
- $ρ (Y_{emul} (x), Y^{0}) < ρ_{1}$ (for the three physics metrics of TOA shortwave radiation, TOA longwave radiation, and TOA net incoming radiation)
- $ρ (Y_{emul} (x), Y^{0}) < ρ_{2}$ (for the two other physics metrics of cloud cover and liquid water path and the five dynamics metrics).
The choice of a smaller threshold for the three radiation metrics is necessary in order to give a higher weight to the constraint on the balanced TOA radiation than on the other metrics. We use ρ₂=2ρ₁. The value of ρ₁ is automatically adjusted in order to select only N parameter sets out of the ensemble of size M. Given that we are interested in drawing parameter configurations that are representative of the space of plausible tuned parameters in only a few iterations, our choice of the implausibility measure, as in Eq. (1), provides stricter constraints on the selected parameters, with the observational means Y⁰ being the target values for the corresponding metrics.
Going back to step 1, generate a new PPE of size N with ICON-A for the parameter ensemble defined in the previous step, and repeat step 2 to step 4.

The iterations stop when one of the model configurations generated in the PPEs is compatible with observations or when a new set 𝒫 of tuning parameters is used. Compatibility with observations is defined based on a weighted distance of the model output metrics from their reference value, with a tolerance given by the corresponding observational uncertainty. The highest weight is given to the global TOA net radiation balance, our main tuning goal. In general, in the earlier iterations of history matching, not all the members of the next round are expected to be compatible with the observational references. The configurations that are found to be compatible with observations are considered to be representative of the space of plausible tuned parameters and are subsequently evaluated based on additional evaluation metrics to assess their quality as tuned configurations (see Sect. 4). The parameter set 𝒫 is changed when the spread of the PPE generated in the last history matching iteration is too far from the observational range. The new parameter set consists of new tuning parameters together with the most influential parameters from the previous 𝒫 for better steering the model outputs towards the observational references. The influence of the parameters on the model outputs is estimated by performing an emulator-based sensitivity analysis with Sobol indices, the details of which are provided in Sect. 3.2.2. This results in a sequential tuning approach, integrating history matching as its core component for constraining the parameters in the sets 𝒫 selected in the different phases. This is schematically shown in Fig. 1.

https://gmd.copernicus.org/articles/18/3681/2025/gmd-18-3681-2025-f01

Figure 1Schematic of the method used for the ML-based tuning of the physics parameters of ICON-A: history matching technique combined with a sensitivity analysis and a sequential parameter selection. The first set of tuning parameters is chosen (A), and history matching is employed to shrink the associated parameter space to an observationally compatible region (B). If the PPEs are far from observational references, a new parameter set is chosen with the help of sensitivity analysis (C). The new parameter set (D) is used for a new phase of the tuning experiment. When one or more of the model configurations generated in the last PPE are compatible with observations, the iterations of this tuning approach stop. The model configurations compatible with observations are then evaluated.

Tuning the ICON-A 2.6.4 climate model with machine-learning-based emulators and history matching

2.1 ICON-A modelling framework

2.2 Parameters and outputs

2.3 Reference datasets

2.4 ML-based tuning approach

3.1 Summary of the generated PPEs

3.2 ML-based tuning of physics outputs with history matching

3.2.1 Performance of the GP emulator

3.2.2 Sensitivity analysis for the physics parameters and outputs

3.2.3 Visualization of the parameter-to-output maps

3.3 Tuning of the dynamics outputs

3.3.1 Analysis of output variability