The modelling of small-scale processes is a major source of error in weather and climate models, hindering the accuracy of low-cost models which must approximate such processes through parameterization. Red noise is essential to many operational parameterization schemes, helping model temporal correlations. We show how to build on the successes of red noise by combining the known benefits of stochasticity with machine learning. This is done using a recurrent neural network within a probabilistic framework (L96-RNN). Our model is competitive and often superior to both a bespoke baseline and an existing probabilistic machine learning approach (GAN, generative adversarial network) when applied to the Lorenz 96 atmospheric simulation. This is due to its superior ability to model temporal patterns compared to standard first-order autoregressive schemes. It also generalizes to unseen scenarios. We evaluate it across a number of metrics from the literature and also discuss the benefits of using the probabilistic metric of hold-out likelihood.

A major source of inaccuracies in climate models is due to “unresolved” processes. These occur at scales smaller than the resolution of the climate model (“sub-grid” scales) but still have key effects on the overall climate. In fact, most of the inter-model spread in how much global surface temperatures increase after

Red noise is a key feature in many parameterization schemes

We use probabilistic machine learning to propose a data-driven successor to red noise in parameterization. The theory explaining the prevalence of red noise in the climate was developed in an idealized model, within the framework of physics-based differential equations. Such approaches may be intractable outside the idealized case. Recently, there has been much work looking at uncovering relationships from data instead

Numerical models represent the state of the Earth system at time

Stochasticity can be included using the following form:

Introducing stochasticity through Eq. (

Hidden variables –

One example is the stochastically perturbed parameterization tendency (SPPT) scheme

There is no intrinsic reason why an AR1 process is the best way to deal with these correlations. It is simply a modelling choice.

Learning the parameters of a climate model, either the simple form (

Amongst ML-trained stochastic models

Recurrent neural networks (RNNs) are a popular ML tool for modelling temporally correlated data, eliminating the need for update functions (like that in Eq.

Current parameterization work using ML to model temporal correlations has predominantly used deterministic approaches, including deterministic RNNs and echo state networks

Other off-the-shelf models are not obviously suited for the parameterization task. Transformers and attention-based models

The L96 set-up and baselines are presented in Sect.

We introduce the L96 model here and then present two L96 parameterization models from the literature which help clarify the above discussion and serve as our baselines.

We use the two-tier L96 model, a toy model for atmospheric circulation that is extensively used for stochastic parameterization studies

The L96 model is useful for parameterization work as we can consider the

The L96 is also useful as it contains separate persistent dynamical regimes, which change with different values of

It is easy to use models with no memory, which do well on standard loss metrics but fail to capture this interesting regime behaviour. This is seen in the L96, where including temporally correlated (red) noise improves the statistics describing regime persistence and frequency of occurrence

The below models are stochastic. They use AR1 processes to model temporal correlations, but this need not be the case, as we show in Sect.

The GAN from

Our model, henceforth denoted as

The key insight is that our model allows more flexibility than the standard AR1 processes for expressing temporal relationships. This is shown by how Eq. (

Figure

Here, as well as in the baselines, the parameterization models are “spatially local”, meaning that the full

The L96-RNN is probabilistic and trained by maximizing the likelihood of sequences

The L96-RNN is trained by maximizing this likelihood with respect to the parameters

From Eq. (

Truth data for training were created by running the full L96 model five separate times with the following values of

The L96-RNN was trained using truncated back propagation through time on sequences of a length of 700 time steps for 100 epochs with a batch size of 32 using Adam

This section analyses performance across a range of timescales. The first results are for

The models were evaluated in a weather forecast framework. A total of 745 initial conditions were randomly taken from the truth data, and an ensemble of 40 forecasts each lasting 3.5 MTU were generated from each initial condition. Figure

The spread is defined as

As noted by

Error and spread for weather experiments. We expect a smaller error for a more accurate forecast model. The spread / error ratio would be close to 1 in a “perfectly reliable” forecast.

The ability to simulate the climate of the L96 was evaluated using the 50 000 MTU simulations. We use histograms to represent various climate-related distributions. For example, Fig.

Histograms of

We can quantitatively measure how well the histogram density estimates from each model match the histogram density estimate from the truth using KL (Kullback–Leibler divergence) divergence. The KL divergence has a direct equivalence to log likelihood: minimizing the KL divergence between a true distribution and a model's distribution is equivalent to maximizing the log likelihood of the true data under the model. We evaluate the KL divergence between

Successful reproduction of the L96 climate would result in a model's histogram matching the truth model's. Qualitatively, from Fig.

KL divergence (goodness of fit) between (1) the truth and (2) the parameterized models, for the distributions shown in the noted figures. The smaller the KL divergence, the better the match between the true and modelled distributions. The best model in each case is shown in bold.

The L96 model used here displays two distinct regimes with separate dynamics. We use the approach from

Regime characteristics of the L96 for

The presence of two regimes is apparent in Fig.

Density plots for each regime for

The ability of the parameterized models to reproduce the “true” temporal correlations of the

The ability of the parameterized models to reproduce the “true” temporal correlations of the sub-grid forcing terms (

We desire our parameterized models to correctly capture the temporal behaviour of the L96 system. One way to assess this is by considering the temporal correlations (linear associations) using an autocorrelation plot. The autocorrelation plots of the generated

We set the L96 forcing to

Difference between the

In all the below experiments the polynomial model's trajectories exploded (went to infinite values), so the polynomial is omitted. This is merely due to the specification of a third-order polynomial. It significantly deviates from its target values when

The

Density plots for each regime for

The models were also evaluated in the weather forecasting framework for

Error and spread for weather experiments for varying forcing values. This is done to assess generalization performance. The L96-RNN's spread is best matched to its error, unlike the GAN.

We can also compare the polynomial and the L96-RNN's ability to capture temporal patterns by assessing their residuals. The residuals are the differences between observed data and a model's predictions. It can be considered an offline metric. For the proposed models, the residuals should be uncorrelated and resemble white noise. In Fig.

Autocorrelation plot of the residuals for the polynomial and the L96-RNN. Ideally, these would show no correlations (and match the black line). The L96-RNN's residual correlations decay more quickly than the polynomial's.

We wish our models to have a lower simulation cost than the full L96. This is measured by considering the number of floating-point operations per simulated time step (

In terms of training times, both ML models were trained using a 32 GB NVIDIA V100S GPU. The L96-RNN and GAN took 12 min and 30 h respectively.

We also evaluate using the likelihood (explained in Sect.

Here, hold-out sets for

The other metrics used above only capture snapshots of model performance. Likelihood is a composite measure which assesses a model's full joint distribution

Likelihood also captures information about more complex, joint distributions, saving the need for custom metrics to be invented to assess specific features. To illustrate this, consider that we wish to assess a model's temporal associations (not just the linear temporal correlations). The likelihood already contains this information. Although custom metrics could be invented to assess this, the likelihood is an off-the-shelf metric which is already available. We suggest it is wasteful not to use it.

Using a composite metric brings challenges though. Poor performance in certain aspects may be overshadowed by good performance elsewhere. This can result in cases where increased hold-out likelihoods do not correspond to better sample quality, as noted in the ML literature

Likelihood's composite nature makes it a helpful diagnostic tool. Just like KL divergence, the further away a model is – in any manner – from the data in the hold-out set, the worse the likelihood will be. If a model has a poor likelihood despite performing well on a range of standard metrics, this suggests there are still deficiencies in the model which need investigating. For example, with the L96-RNN at

Log likelihood on hold-out data for different forcing values. The best model in each case is shown in bold.

We present an approach to replace red noise with a more flexible stochastic machine-learnt model for temporal patterns. Even though we used ML to model the deterministic part

Using physical knowledge to structure ML models can help with learning. The L96-RNN includes “physically relevant” features for the L96, particularly advection, in Eq. (

There are more sophisticated models which can be used for the L96 system. We noticed that in some cases the L96-RNN struggled even to overfit the training data, regardless of the L96-RNN complexity. This could be due to difficulties in learning the evolution of the hidden variables. Creating architectures that permit better hidden variables to be learnt (ones which model long-term correlations better) would give better models. Despite our use of GRU cells, which along with the LSTM were great achievements in sequence modelling, we still faced issues with the modelling of long-term trends as evidenced in Fig.

We trained the L96-RNN using the likelihood of the sequence

Learning from all the high-resolution data whilst still being computationally efficient at simulation time is an interesting idea. Whilst we cannot keep all the high-resolution variables in our schemes (as otherwise we might as well run the high-resolution model outright – which is not possible as it is too slow for the desired timescales and is a reason why this field of work exists), only keeping the average (as in existing work which coarse-grains data) means many data are thrown away. Our preliminary investigations used the graphical model in Fig.

Model which allows learning from high-resolution data without requiring it at simulation time, where

Further work with GANs could result in better models. There may be aspects of realism which we either may not be aware of or may not be able to quantify, so they are difficult to include explicitly in our probability models. The GAN discriminator could learn what features constitute “realism” and so the generator may learn to create sequences which contain these. However, we found that using the adversarial approaches from GANs and Wasserstein GANs

We have shown how to build on the benefits of red-noise models (such as AR1 ones) in parameterizations by using a probabilistic ML approach based on a recurrent neural network (RNN). This can be seen as a natural generalization of the classical autoregressive approaches. This is done in the idealized case of the two-level Lorenz 96 system, where the unresolved variables must be parameterized. Our L96-RNN outperforms the red-noise baselines (a polynomial model and a GAN) in a weather forecasting framework and has lower KL divergence scores for the probability density functions arising from long-range simulations. The L96-RNN generalizes the best to new forcing scenarios. These strong empirical results, along with a less correlated residual pattern, supports the benefit of the L96-RNN being due to it better capturing temporal patterns.

This approach now needs testing on more complex systems – both to specifically improve on AR1 processes and to learn new, more flexible models of

Likelihood can often be fairly easily calculated, and where this is the case, we propose that the community also evaluate the hold-out likelihood for any devised probabilistic model (ML or otherwise). It is a useful debugging tool, assessing the full joint distribution of a model. It would also provide a consistent evaluation metric across the literature. Given the challenges relating to sample quality, likelihood should complement, not replace, existing metrics.

Finally, we have used demanding tests to show that ML models can generalize to unseen scenarios. We cannot hope for ML models to generalize to all settings. We do not expect this from our physics-based models either. But ML models

In all cases, the log likelihood of the sequence of

Now from Eq. (

For the polynomial model, the log likelihood of the

The graphical model is shown in Fig.

Graphical model for

We approximate the GAN's likelihood by calculating the likelihood of a model which functions and gives results almost identical to the GAN (one with a small amount of white noise added) using importance sampling and the reparameterization method as in the variational autoencoder

The graphical model of the GAN with a small amount of white noise added is shown in Fig.

For training purposes, the lower bound is used:

The terms in the numerator decompose as follows, using the independencies from the graphical model and associated equations:

The perfect importance sampling distribution would be proportional to

The training of the importance sampler is done by maximizing Eq. (

For random variables,

Now we will show how we arrive at Eq. (

From this we see

And from the conditional independencies in Eqs. (

It will sometimes not be possible to apply the approach based on changes of variables to more complex systems, for a few reasons. First, it requires the function

Further results for KL divergences are provided in order to see how results vary with differing bin sizes for the histograms.

Table

KL divergence (goodness of fit) between (1) the truth and (2) the parameterized models, for the distributions shown in Fig.

KL divergence (goodness of fit) between (1) the truth and (2) the parameterized models, for the distributions shown in Fig.

All code used in this study (including the code required to create the data) is publicly available at

All authors supported the design of the study. RP created the models and conducted research. RP prepared the paper with contributions from all co-authors.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We would like to thank Pavel Perezhogin and our other anonymous reviewer for their feedback, which has improved the paper.

Raghul Parthipan was funded by the Engineering and Physical Sciences Research Council (grant number EP/S022961/1). Hannah M. Christensen was supported by the Natural Environment Research Council (grant number NE/P018238/1). J. Scott Hosking was supported by the British Antarctic Survey, Natural Environment Research Council (NERC) National Capability funding. Damon J. Wischik was funded by the University of Cambridge.

This paper was edited by Travis O'Brien and reviewed by Pavel Perezhogin and one anonymous referee.