Earth system models (ESMs) are the primary tools for investigating future Earth system states at timescales from decades to centuries, especially in response to anthropogenic greenhouse gas release. State-of-the-art ESMs can reproduce the observational global mean temperature anomalies of the last 150 years. Nevertheless, ESMs need further improvements, most importantly regarding (i) the large spread in their estimates of climate sensitivity, i.e., the temperature response to increases in atmospheric greenhouse gases; (ii) the modeled spatial patterns of key variables such as temperature and precipitation; (iii) their representation of extreme weather events; and (iv) their representation of multistable Earth system components and the ability to predict associated abrupt transitions. Here, we argue that making ESMs automatically differentiable has a huge potential to advance ESMs, especially with respect to these key shortcomings. First, automatic differentiability would allow objective calibration of ESMs, i.e., the selection of optimal values with respect to a cost function for a large number of free parameters, which are currently tuned mostly manually. Second, recent advances in machine learning (ML) and in the number, accuracy, and resolution of observational data promise to be helpful with at least some of the above aspects because ML may be used to incorporate additional information from observations into ESMs. Automatic differentiability is an essential ingredient in the construction of such hybrid models, combining process-based ESMs with ML components. We document recent work showcasing the potential of automatic differentiation for a new generation of substantially improved, data-informed ESMs.

Comprehensive Earth system models (ESMs) are the key tools to model the dynamics of the Earth system and its climate and in particular to estimate the impacts of increasing atmospheric greenhouse gas concentrations in the context of anthropogenic climate change

With the recent advances in the number, accuracy, and resolution of observational data, it has been suggested that ESMs could benefit from more direct ways of including observation-based information, e.g., in the parameters of ESMs

ESMs couple general circulation models (GCMs) of the ocean and atmosphere with models of land surface processes, hydrology, ice, vegetation, atmosphere and ocean chemistry, and the carbon cycle. To our knowledge, there are currently no comprehensive, fully differentiable ESMs, and there are only select ESM components which are differentiable. Most commonly, these are GCMs used for numerical weather prediction, as they usually need to utilize gradient-based data assimilation methods. These GCMs often achieve differentiability with manually derived adjoint models and not via AD. However, considerable efforts have also been spent on frameworks such as dolfin-adjoint for finite-element models

Differentiable programming enables several advantages for the development of ESMs that we will outline in this article. First, differentiable ESMs would allow substantial improvements regarding the systematic calibration of ESMs, i.e., finding optimal values for their

Parameter calibration is probably the most obvious benefit of differentiable programming to ESMs. This is why we first review the current state of the calibration of ESMs before we introduce differentiable programming and automatic differentiation. Thereafter, we will argue for the different benefits that we see for differentiable ESMs and challenges that have to be addressed when developing differentiable ESMs.

Comprehensive ESMs, such as those used for the projections of the Coupled Model Intercomparison Project (CMIP)

Differentiable programming is a paradigm that enables building parameterized models whose parameters can be optimized using gradient-based optimization

It is important to note that AD is neither a numerical nor a symbolic differentiation: it does not numerically compute derivatives of functions with a finite difference approximation and does not construct derivatives from analytic expressions like computer algebra systems. Instead, AD computes the derivative of an evaluation of some function of a given model output, based on a non-standard execution of its code so that the function evaluation can be decomposed into an evaluation trace or computational graph that tracks every performed elementary operation. Ultimately, there is only a finite set of elementary operations such as arithmetic operations or trigonometric functions, and the derivatives of those elementary operations are known to the AD system. Then, by applying the chain rule of differentiation, the desired derivative can be computed. AD systems can operate in two different main modes: a forward mode, which traverses the computational graph from the given input of a function to its output, and a reverse mode, which goes from function output to input. Reverse-mode AD achieves better scalability with the input size, which is why it is usually preferred for optimization tasks that usually only have a single output – a cost function – but many inputs (see, e.g.,

The defining feature of differentiable models is the efficient and automatic computation of gradients of functions of the model output with respect to (i) model parameters, (ii) initial conditions, or (iii) boundary conditions. Applying this paradigm to Earth system models (ESMs) would enable gradient-based optimization of its parameters and the application of other methods that require information on gradients. For example, suppose for simplicity that the dynamics of an ESM may be represented, after discretization on an appropriate spatial grid, by an ordinary differential equation of the form

Crucially, differentiable programming allows these derivatives to be computed for arbitrary choices of

Taken together, such approaches would directly move forward from the mostly applied manual and subjective parameter tuning to transparent, systematic, and objective parameter optimization. Moreover, automatic differentiability of ESMs would provide an essential prerequisite for the integration of data-driven methods, such as ANNs, resulting in hybrid ESMs

Aside from AD, another approach to differentiable models is to manually derive and implement an adjoint model, usually from a tangent linear model. This is especially common in GCMs that have been used for numerical weather prediction, as data assimilation schemes such as 4D-Var

Adjoint models of several ESM components have already been generated automatically with AD tools such as Transforms of Algorithms in Fortran (TAF)

Differentiable programming enables gradient- and Hessian-based optimization of ESMs. Usually, gradients of a cost function

Fully automatically differentiable ESMs would enable the gradient-based optimization of all model parameters. Moreover, in the context of data assimilation, AD would strongly facilitate the search for optimal initial conditions for a model in question, given a set of incomplete observations. In the context of Earth system modeling, AD could lead to substantial advances mainly in three fields (Fig.

Gradient-based optimization as facilitated by AD would allow for transparent, systematic, and objective calibration of ESMs (see Sect.

Uncertainties in ESMs can stem either from the internal variability of the system (aleatoric uncertainties) or from our lack of knowledge of the modeled processes or data to calibrate them (epistemic uncertainty). In climate science, the latter is also referred to as model uncertainty, consisting of structural uncertainty and parameter uncertainty. Assessing these two classes of epistemic uncertainty is crucial in understanding the model itself and its limitations but also increases reproducibility of studies conducted with these models

Regarding parameter uncertainty, of particular interest are probability distributions of parameters of an ESM, given calibration data and hyperparameters; see, e.g.,

Gradient-based optimization is not limited to the intrinsic parameters of an ESM; it also allows for the integration of data-driven models. ANNs and other ML methods can be used to either accelerate ESMs by replacing computationally costly process-based model components by ML-based emulators or learn previously unresolved influences from data.
It is also possible to combine ANNs with process-based physical equations of motion, e.g., via the universal differential equation framework

Many physical processes occur on scales too small to be explicitly resolved in ESMs, for example the formation of individual clouds. To nevertheless obtain a closed description of the dynamics, parameterizations of the processes operating below the grid scale are necessary. While the training process of ANNs is computationally expensive, once trained, their execution is usually computationally much cheaper than integrating the physical model component that they emulate.

A growing number of processes in ESMs are not based on fundamentally known primitive equations of motion, such as the Navier–Stokes equation of fluid dynamics for the atmosphere and oceans. For example, vegetation models are typically not primarily based on primitive physical equations of the underlying processes but rather on effective empirical relationships and ecological paradigms. For suitable applications, many ESM components, such as those describing land surface and vegetation processes, ice sheets, the carbon cycle, or subgrid-scale process in the ocean and atmosphere, could be replaced or augmented with data-driven ANN models. Even components that are based on known primitive equations of motion have to be discretized to finite spatial grids in practice so that they can be integrated numerically. The resulting parameterizations will necessarily introduce errors that can be attenuated by suitable data-driven and especially ML methods. For example,

Ideally, a hybrid ESM could combine both of these approaches (emulation and modeling of unresolved processes) and perform its final parameter optimization for both the physically motivated parameters and the ANN parameters at once, in the full hybrid model. Differentiable programming would enable such a procedure. Differentiable ESMs are thus prime candidates for strongly coupled neural ESMs in the terminology of

Essentially, ESMs are algorithms that integrate discretized versions of differential equations describing the dynamics of processes in the Earth system. As such, a comprehensive, hybrid, differentiable ESM could constitute a UDE

When replacing or augmenting parts of an ESM with ML methods such as ANNs, one has to train the ANN by minimizing a cost function

Existing applications of ML methods to subgrid parameterizations in ESMs generally follow a three-step procedure

Training data are generated from a reference model, for example a model of some subgrid-scale process that would be too costly to incorporate explicitly in the full ESM.

An ML model is trained to emulate the reference model or some part of it.

The trained ML component is integrated into the full ESM, resulting in a hybrid ESM.

Recent work demonstrates the advantages of training the ML component in online mode using differentiable programming techniques.

An additional benefit of training a hybrid ESM in online mode is the ability to optimize not only with respect to specific processes, but also with respect to the overall model climate. An ML parameterization trained in offline mode will typically be trained to emulate the outputs of an existing process-based parameterization for a plausible range of inputs. However, even for models which perform very well offline, it is not known if they will produce a realistic climate until they are coupled to the ESM after training. In contrast, an equivalent ML parameterization trained in online mode can be optimized with respect to not only the outputs of the parameterization itself, but also the reproduction of a realistic overall climate.

In a typical scenario for hybrid ESMs, e.g., an ANN-based subgrid parameterization, online learning can also lead to more stable and accurate solutions, as showcased by studies in fluid dynamics

While the benefits of differentiable ESMs are extremely promising, they come at a cost. Every AD system has certain limitations, and there might not even exist a capable AD system in the programming language in which an existing ESM is written. Many state-of-the-art AD systems have been designed with ML workflows in mind, which usually consist of pure functions with only limited support for array mutation and in-place updates

Memory demand is a fundamental challenge when computing gradients of functions of trajectories of ESMs over many time steps; saving all intermediate steps needed to compute the gradient requires a prohibitively large amount of RAM. Therefore, checkpointing schemes have to balance memory usage with recomputing intermediate steps. There are different schemes available to do so, such as periodic checkpointing or Revolve, that try to optimize this balance

ESMs utilize different discretization techniques and solvers: (pseudo-)spectral, finite-volume, finite-element, and other approaches can be used in the different components of ESMs. Differentiable modeling is possible for all of these approaches in principle. While this is relatively straightforward for spectral models, it has also been demonstrated for finite-volume and finite-element solvers

Solvers can also make use of AD during their forward computation, e.g., when solving the involved nonlinear equation systems. Some solvers also have to make use of slope or flux limiters to eliminate spurious oscillations close to discontinuities of the solution (see, e.g.,

Aside from technical challenges, a more fundamental problem to address is the chaotic nature of the processes represented in ESMs. Nearby trajectories quickly diverge from each other, which makes optimization based on gradients of functions of trajectories error prone if the practitioner is not aware of this. Often, gradients computed both from AD or iterative methods and adjoint sensitivity analysis are orders of magnitude too large because of ill-conditioned Jacobians and the resulting exponential error accumulation; see

An additional challenge for differentiable ESMs is the inclusion of physical priors and conservation laws. While the parameters of a differentiable ESM may generally be varied freely during gradient-based optimization, it is nonetheless desirable that they should be constrained to values which lead to physically consistent model trajectories. This challenge is particularly acute for hybrid ESMs, in which some physical processes may be represented by ML model components with many optimizable parameters. Enforcing physical constraints is an essential step towards ensuring that hybrid ESMs, which are tuned to present-day climate and the historical record, will nonetheless generalize well to unseen future climates.

A number of approaches have been proposed to combine physical laws with ML. Physics-informed neural networks

The ever-increasing availability of data, recent advances in AD systems, optimization methods, and data-driven modeling from ML create the opportunity to develop a new generation of ESMs that are automatically differentiable. With such models, long-standing challenges like systematic calibration, comprehensive sensitivity analyses, and uncertainty quantification can be tackled, and new ground can be broken with the incorporation of ML methods into the process-based core of ESMs.

Ideally, every single ESM component, including couplers, would need to be differentiable, which of course takes a considerable amount of work to realize. Differentiable programming requires different programming languages and styles than have been common practice in ESMs. Automated code translation might assist this process in the future, as ML-based tools like ChatGPT

If almost all components of an ESM are differentiable, but one component is not, one might still be able to achieve a fully differentiable model through implicit differentiation, as recent advances also work towards automating implicit differentiation

For the calibration of ESMs, differentiable programming enables not only a gradient-based optimization of all parameters together, but also more carefully chosen procedures where expert knowledge is combined with the optimization of individual parameters. Similarly, the cost function that is used in the tuning process can be easily varied and experimented with. The ability to objectively optimize all parameters for a differentiable ESM does of course not imply that all parameters should be optimized. Rather, differentiable ESMs allow documenting which parameters are calibrated to best reproduce a given feature of the Earth system in a transparent manner.

Where previous studies had to use emulators of ESMs to showcase the potential of differentiable models, fully differentiable ESMs can harness this potential while maintaining the process-based core of these models. Differentiable ESMs also enable this process-based core to be supplemented with ML methods more easily. Deep learning has shown enormous potential, e.g., for subgrid parameterization, to attenuate structural deficiencies and to speed up individual, slow components by replacing them with ML-based emulators. Differentiable ESMs make this process easier. The possibility of online training of ML models within ESMs promises to lead to more accurate and stable solutions of the combined hybrid ESM.

Aside from this, differentiable ESMs also enable further studies on the sensitivity and stability of the Earth’s climate, which previously had to rely on gradient-free methods. For example, algorithms to construct response operators to further study how fluctuations, natural variability, and response to perturbations relate to each other

Differentiable ESMs are a crucial next step toward improved understanding of the Earth’s climate system, as they would be able to fully leverage increasing availability of high-quality observational data and to naturally incorporate techniques from ML to combine process understanding with data-driven learning.

MG led the preparation of the manuscript. All authors discussed the outline and contributed to writing the manuscript.

The contact author has declared that none of the authors has any competing interests.

No data sets were used in this article.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work received funding from the Volkswagen Foundation. NB acknowledges further funding from the European Union's Horizon 2020 Research and Innovation program under grant agreement no. 820970 and the Marie Sklodowska-Curie program under grant agreement no. 956170, as well as the Federal Ministry of Education and Research under grant no. 01LS2001A. This is TiPES contribution no. 175.

This research has been supported by the Horizon 2020 (TiPES (grant no. 820970)), the Horizon Europe Marie Sklodowska-Curie Actions (grant no. 956170), the Bundesministerium für Bildung und Forschung (grant no. 01LS2001A), and the Volkswagen Foundation.This work was supported by the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program.

This paper was edited by David Ham and Rolf Sander, and reviewed by Samuel Hatfield and one anonymous referee.