Over the last couple of years, machine learning parameterizations have emerged as a potential way to improve the representation of subgrid processes in Earth system models (ESMs). So far, all studies were based on the same three-step approach: first a training dataset was created from a high-resolution simulation, then a machine learning algorithm was fitted to this dataset, before the trained algorithm was implemented in the ESM. The resulting online simulations were frequently plagued by instabilities and biases. Here, coupled online learning is proposed as a way to combat these issues. Coupled learning can be seen as a second training stage in which the pretrained machine learning parameterization, specifically a neural network, is run in parallel with a high-resolution simulation. The high-resolution simulation is kept in sync with the neural network-driven ESM through constant nudging. This enables the neural network to learn from the tendencies that the high-resolution simulation would produce if it experienced the states the neural network creates. The concept is illustrated using the Lorenz 96 model, where coupled learning is able to recover the “true” parameterizations. Further, detailed algorithms for the implementation of coupled learning in 3D cloud-resolving models and the super parameterization framework are presented. Finally, outstanding challenges and issues not resolved by this approach are discussed.

The representation of subgrid processes, especially clouds, is the main cause of uncertainty in climate projections and a large error source in weather predictions

Confusingly, even though the paper appears to have been published in 1995, most people refer to the model as the Lorenz 96 model.

.Over the last couple of years, several attempts have been made to build ML subgrid parameterizations, all of which followed a similar approach (Fig.

Schematic overview of ML parameterization workflow with and without coupled online learning.

The three attempts differ in training data and ML algorithms used. In

In some SPCAM versions radiation is computed on the CRM grid.

, surface processes and the dynamics are computed on the GCM grid as usual. Compared to a global 3D CRM, SP is obviously less realistic but has several conceptual and technical advantages. First, subgrid- and grid-scale processes are clearly separated, which makes it easy to define the parameterization task for a ML algorithm. Second, because the CRM lives in isolation, it exactly conserves certain quantities (e.g., energy and mass). A third, very practical advantage is that SP simulations are significantly cheaper than global 3D CRMs. In our study we trained a deep neural network to emulate the CRM tendencies. The offline validation scores were very encouragingBB18 then fitted a neural network to the coarse-grained data, which produces good results in offline mode. In online mode, however, they also experienced instabilities.

The third online parameterization by

That is, at least to a good degree of approximation. Predictions of decision trees and therefore also random forests are averages over several training targets. Each target will perfectly obey constraints. Since the conservation constraints are likely nonlinear, an average does not necessarily keep this property but probably comes close.

. Comparing the results ofCoupled online learning is essentially a second training step after the first offline training on a reference dataset. The basic idea of coupled learning is to run the low-resolution model with the machine learning parameterization (ML-LR) model in parallel with the high-resolution (HR) model and train the network every or every few time steps (3b in Fig.

A note on the terminology: I will use the terms HR (high-resolution) and LR (low-resolution) here when speaking about the general algorithm. When talking specifically about atmospheric science applications, I will use the more common terms CRM and GCM.

. The HR model is continuously nudged towards the LR model state, keeping the two simulations close to each other. How close the two runs are depends on the nudging timescaleThe instability issues in previous studies can also be seen as a consequence of overfitting to the reference simulation used for training. Once the ML parameterization is coupled to the LR model it will create its own climate, which likely lies somewhat outside the training manifold. This can easily lead to problems because neural networks struggle to extrapolate beyond what they have seen during training. Coupled learning combats this problem by extending the training with HR targets for each state that the ML-LR model produces.

Evolution of a tracer

The algorithmic details of coupled learning differ depending on the exact model setup. The main contribution of this paper will be to describe coupled learning algorithms for the simple L96 model as well as global 3D HR models and SP models. To understand how coupled learning actually works it is helpful to draw diagrams for the evolution of a tracer

Note that “tendencies” are defined per unit of time, while “increments” are tendencies multiplied by a time step.

I will call this the

Typically, in a LR model time step the physics is run before the dynamics. But where the time step starts and ends is arbitrary, so the two can be switched without problems.

The L96 model

For animations of the L96 system, see

Blue dots are data points from a reference simulation with the real L96 parameters. The solid orange and green lines are the linear regression and neural network parameterization fitted to this data. The red dots are data points from the L96 simulations with “wrong” parameter values used for pretraining. The dashed lines are the parameterization fits for these wrong values, which serve as a starting point for the coupled learning experiments.

For parameterization research,

Two parameterizations will be considered: a linear regression and a neural network. The linear regression case is easily interpretable and helps to illustrate the learning procedure, while the neural network is a more realistic case.

The linear regression parameterization looks as follows:

Neural networks consist of one or multiple layers of linearly connected nodes, modified by nonlinear activation functions

For a great introduction to neural networks, see

To mimic the situation in a real climate model where the parameterization would first be pretrained

All experiments were done in a Jupyter notebook that can be launched via Binder from the GitHub repository at

Algorithm 1 outlines the workflow for coupled learning in the L96 framework. There are several hyper-parameters. First, we have the time steps

The experiments indicate that coupled learning works well in both cases (see Jupyter notebook; Fig.

Another hyper-parameter is the update frequency of the neural network

Finally, the nudging timescale

The same algorithm can be used to train much more complicated parameterizations such as a neural network (Fig.

The L96 model, while commonly used to test parameterization and data assimilation approaches, only represents a small fraction of the challenges that algorithms are faced with in real GCMs. In particular, L96 does not exhibit any of the issues that require a coupled learning approach in the first place; an offline parameterization for the L96 model is stable and does not show major biases. The purpose of demonstrating the method using the L96 model is mostly a sanity check. Having confirmed that coupled learning works in this simple framework now gives us more confidence to try to apply it for more complex systems.

In this section, I will outline how coupled learning algorithms can be applied to 3D CRMs and super-parameterized GCMs.

The 3D HR case is similar to the L96 setup (Algorithm 2). The key difference is that the scale separation is not clearly defined as in L96 or SP, but rather downscaling (coarse-graining) and upscaling are required to get the HR state on the LR model grid and, reversely, apply the forcing term, which is computed on the LR model grid, in the HR model. Issues with this will be further discussed in Sect.

One major conceptual difference of the 3D HR case from SP (see below) lies in what is actually learned by the neural network during coupled learning. In SP, the CRM is purely responsible for clouds and turbulence, while a 3D HR model also evolves globally according to its own set of physics. What this means is that the neural network essentially learns a subgrid correction term that compensates for everything(!) missing from the LR model dynamics and non-ML physics in comparison to the HR model (

Evolution of a tracer

Similar to L96, SP has the advantage of a clean scale separation, which makes the parameterization learning task easier. It also provides a good framework for coupled learning since SP already has the LR model and the embedded CRMs running in parallel. Because the embedded CRMs do not have any large-scale dynamics on their own, the time step schematic in Fig.

In the three original ML parameterization studies, of the prognostic variables, only temperature and humidity were used in the input and output. This was done to reduce the complexity of the problem to the fewest prognostic variables necessary to produce a general circulation. In coupled learning, the variables used by the ML parameterization also have to be forced in the HR model. The HR model will typically have many more prognostic variables compared to the LR model (e.g., hydrometeors), but it is alright for those to evolve without forcing. In fact, this might be necessary since the HR and LR models might have different prognostic variables. This is the case in SP where only the LR model prognostic variables are forced during CRM integration. If the variables predicted by the neural network differ, for example temperature vs. moist static energy, an additional conversion step has to be added to the up- and downscaling described below.

So theoretically coupled learning should work fine even if only temperature and humidity are forced/predicted. However, there are reasons for going beyond this. First, it is likely that the network skill suffers from not having information about, e.g., cloud water. We saw this in RPG18 where the network was essentially unable to produce a shallow cloud heating signature in the subtropics. Second, to implement physical constrains it is necessary to add more variables in order to close the conservation budgets, which we will discuss now.

A major critique of machine learning and especially neural network parameterizations is that they do not obey physical constraints. However,

One downside of implementing physical constraints in

When using coarse-grained HR output as training data as in BB18, the residuals (Eq.

Another issue is how to convert 3D fields from the LR model to the HR grid and vice versa. I already mentioned downscaling or coarse-graining along with some issues in the context of discussing BB18. For coupled learning in the 3D HR setup (Algorithm 2) a downscaling algorithm

See

Depending on the setup, there are some daunting technical challenges for the implementation of coupled learning. SPCAM represents the easiest case because it already has the embedded CRMs running in parallel with the LR model with coupling. The key challenge here would be the implementation of the neural network forward and backward pass. We have already implemented the forward pass in RPG18 by hard-coding it in Fortran. This works but is error-prone, hard to debug and cumbersome. Backpropagation along with a modern gradient descent algorithm like Adam

See Noah Brenowitz's blog post at

CLIMA might be just that eventually; see

For the 3D HR setup, in addition to the neural network implementation and the up- and downscaling issues, coupled learning requires two models to be run in parallel communicating every few time steps. This potentially requires quite a lot of engineering. My guess is that a successful and relatively quick implementation of coupled learning requires extensive working knowledge of the atmospheric models used.

Running a HR model is expensive. Therefore, it is essential that the coupled learning algorithm is efficient enough to learn from a limited number of coupled HR simulations. To judge this, L96 is a bad toy model because it is so far removed from the actual problem. On the one hand, the parameterization task is exceedingly easy (one input, one output). On the other hand, it has 32 “LR” grid points, while a 2

Coupled learning is a potential method to combat some of the main obstacles in ML parameterization research: instabilities and tuning. In this paper my aim was to present the algorithms and challenges as clearly as possible and demonstrate the general feasibility in the L96 case. The next step will be to test coupled learning in a more realistic framework. Some open questions are as follows. How much weight should be given to new samples, particularly if the tendencies are substantially chaotic? Are the HR and ML-LR model guaranteed to converge? Will the

There are a number of problems with ML parameterizations that coupled learning cannot address. First and foremost for climate modeling is generalization, i.e., the ability of a neural network parameterization to perform well outside its training regime. Neural networks are essentially nonlinear regression algorithms and should not be expected to learn anything beyond what they have encountered during training

Another issue unsolved by coupled learning is stochasticity. Any deterministic ML model that minimizes a mean error will be unable to represent random fluctuations in the training dataset. This leads to smoothed out predictions. The case for stochastic parameterizations has been growing steadily,

Parametric approaches have been commonly used for postprocessing of NWP forecasts

Finally, high-resolution models might be better than coarse models, but they still are not the truth. Our best knowledge of the true behavior of the atmosphere comes from observations. The problem is that observations are intermittent in space and time and, in the case of remote sensing, indirect. So how to learn from such data?

Clouds are incredibly complex. No wonder then that we humans have such trouble shoving them into mathematical concepts. We need any assistance we can get. Could ML provide us with such? The verdict is still out. First studies show that ML models are, in general, capable of representing subgrid tendencies, but the way towards actually improving weather and climate models poses several obstacles. Coupled online learning could be one potential solution out of many to overcome some of these obstacles.

All code (version 1.0) along with an interactive notebook is available at

The author declares that there is no conflict of interest.

I thank Chris Bretherton, Noah Brenowitz, Tapio Schneider, Sebastian Scher, David John Gagne, Tom Beucler, Mike Pritchard and Pierre Gentine for their valuable input.

This research has been supported by the German Research Foundation (grant no. SFB/TRR 165).This work was supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program.

This paper was edited by David Topping and reviewed by two anonymous referees.