Recently, there has been growing interest in the possibility of using neural networks for both weather forecasting and the generation of climate datasets. We use a bottom–up approach for assessing whether it should, in principle, be possible to do this. We use the relatively simple general circulation models (GCMs) PUMA and PLASIM as a simplified reality on which we train deep neural networks, which we then use for predicting the model weather at lead times of a few days. We specifically assess how the complexity of the climate model affects the neural network's forecast skill and how dependent the skill is on the length of the provided training period. Additionally, we show that using the neural networks to reproduce the climate of general circulation models including a seasonal cycle remains challenging – in contrast to earlier promising results on a model without seasonal cycle.

Synoptic weather forecasting (forecasting the weather at lead times of a few days up to 2 weeks) has for decades been dominated by computer models based on physical equations – the so-called numerical weather prediction (NWP) models. The quality of NWP forecasts has been steadily increasing since their inception

Recently, in addition to using machine learning to enhance numerical models, there have been ambitions to use it to tackle weather forecasting itself. The holy grail is to use machine learning, and especially “deep learning”, to completely replace NWP models, although opinions may diverge on if and when this will happen. Additionally, it is an appealing
idea to use neural networks or deep learning to emulate very expensive general circulation models (GCMs) for climate research. Both these ideas have been tested with some success for simplified realities

Here, we build upon the approach from

To avoid confusion, we use the following naming conventions throughout the paper: “model” always refers to physical models (i.e., the climate models used in this study) and will never refer to a “machine learning model”. The neural networks are referred to by the term “network”.

To create long climate model runs of different complexity, we
used the Planet Simulator (PLASIM) intermediate-complexity GCM, and its dry dynamical core: the Portable University Model of the Atmosphere (PUMA)

The PUMA runs do not include ocean and orography and use Newtonian cooling for diabatic heating or cooling. The PLASIM runs
include orography, but no ocean model. The main difference between PLASIM and standard GCMs is that sub-systems of the Earth system other than the atmosphere (e.g., the ocean, sea ice and soil) are reduced to systems of low complexity. We used the default integration time step for all four model setups, namely 30 min for plasimt42, 20 min for plasimt21, and 60 min for pumat21 and pumat42. We regrid all fields to regular lat–long grids on pressure levels for analysis. An additional run was made with PUMA at resolution T21, but with the seasonal cycle switched off (eternal boreal winter). This is the same configuration as used in

In order to contextualize our results relative to previous studies, we also run a brief trial of our networks on ERA5

Ranking climate models according to their “complexity” is a nontrivial task, as it is very hard to define what complexity actually means in this context. We note that here we use the term loosely and do not refer to any of the very precise definitions of complexity that exist in various scientific fields (e.g.,

To circumvent this conundrum, we adopt a very pragmatic approach based solely on the output of the models and grounded in dynamical systems theory. We quantify model complexity in terms of the local dimension

The local dimension was computed for 38 years of each model
run, as well as for the ERA-Interim reanalysis on a

Averages of the local dimension

Neural networks are in principle a series of nonlinear functions with weights determined through training on data. Before the training, one has to decide the

Architecture of the neural network for the models with resolution T21 (left) and T42 (right). Each box describes a network layer, with the numbers indicating the dimension of (None, lat, long, level). “None” refers to the time dimension that is not relevant here but which we include in the schematic since it is part of the basic notation of the software library used. Figure based on Fig. S1 from

The last 10 % samples of the training data are used for validation.
This allows us to monitor the training progress, control overfitting
(the situation where the network works very well on the training data, but very poorly on the test data) and potentially limit the maximum number of training epochs. As input to the neural networks we use four variables (

All networks are trained to make 1 d (1 day) forecasts. Longer forecasts are made by iteratively feeding back the forecast into the network. We did not train the network directly on longer lead times, based on the finding of

All network forecasts are validated against the underlying model the network was trained on (e.g., for the forecasts of the network trained on plasimt21, the “truth” is the evolution of the plasimt21 run). We specifically adopt two widely used forecast verification metrics, namely the root mean square error (RMSE)
and the anomaly correlation coefficient (ACC). The RMSE is defined
as

To compute a score over the whole period, the ACCs for all individual
forecasts are simply averaged:

The ACC ranges from

For the evaluation of the brief ERA5 test in Sect. 3.6, we use the mean absolute error instead of RMSE to keep in line with

We start by analyzing the RMSE and ACC of two of the most important variables: 500 hPa geopotential height (hereafter zg500) and 800 hPa temperature (hereafter ta800). The first is one of the most commonly used validation metrics in NWP; the second is very close to temperature at 850 hPa, which is another commonly used validation metric. We focus on the networks trained with 100 years of training data, which is the same length as in

Root mean square error (RMSE) and anomaly
correlation coefficient (ACC) of 500 hPa geopotential height

We next turn our attention to the spatial characteristics of the forecast error. Figure

Maps of RMSE of the 6 d network forecasts, for the networks trained with 100 years for pumat21

A key issue is the extent to which the above results, and more generally the skill of the network forecasts, depend on the length of the training period used. Figure

Dependence of network forecast skill on the length of the training period. Shown is the root mean square error (RMSE) and anomaly correlation coefficient (ACC) of the network-forecasted 500 hPa geopotential height for networks trained on different amounts of training years. Each line represents one model. The shading on the left side of the plots represents the 5–95 uncertainty range of the mean RMSE or ACC, estimated over networks with different training samples.

The trained networks are not limited to forecasting the model weather but can also be used to generate a climate run starting from an arbitrary state of the climate model. For this, we use the climate networks that also include the day of year as input (see Sect. 2). The question of whether the climate time series obtained from the network is stable is of special interest. In

Evolution of daily ta800 at a single grid point at 76

Thirty-year mean of normalized 500 hPa geopotential height for plasimt21

All our networks were trained to make 1 d forecasts. The influence of the seasonal cycle on atmospheric dynamics – and especially the influence of diabatic effects – may be very small for 1 d predictions. This could make it hard for the network to learn the influence of seasonality. To test this, we repeated the training of the climate network for plasimt21 but now training on 5 d forecasts. This improved the seasonality of ta800 at our example grid point (Fig. S19), but the spatial fields of the climate run still become unrealistic (Video S9).

The design of this study was to use an already established neural network architecture – namely one tuned on a very simple model – and apply it to more complex models. However, it is of interest to know how much tuning the network architecture to the more complex models might increase forecast skill. Therefore, the same tuning procedure as in

PLASIM, in contrast to PUMA, includes a hydrological cycle. This is represented by three additional 3-D state variables in the model (relative humidity, cloud liquid water content and cloud cover). To the test the impact of including these variables in the neural networks, we retrained the network for 100 years of plasimt21 including these three variables (on 10 levels) as additional channels both in the input and output of the network. The network thus has 70 instead of 40 in- and output channels. Quite surprisingly, including the hydrological cycle variables slightly deteriorated the forecasts of zg500 and ta800, both in terms of RMSE and ACC, except for the ACC of ta800 from lead times of 6 d onwards (Fig. S20).

Mean absolute error of 500 hPa geopotential height for networks trained on coarse-grained (T21) ERA5 data. The training was performed on different lead times. The

In order to put our results into context, we also trained our network architecture on coarse-grained (regridded to T21) ERA5

We have tested the use of neural networks for forecasting the “weather” in a range of simple climate models with different complexity. For this we have used a deep convolutional encoder–decoder architecture that

In the architecture used in this paper, lat–long grids are used to represent the fields. The convolution layers consist of a set of fixed filters that are moved across the domain. Therefore, the same filter is used close to the poles and close to the Equator, even though the area represented by a single grid point is not the same close to the poles and close to the Equator since the Earth is a globe. Using spherical convolution layers as proposed in

include one or more locally connected layers in addition to the convolution layers. While they can increase problems related with overfitting, locally connected layers could learn “local dynamics”.

include spatial dependence in height. In our architecture, all variables at all heights are treated as individual channels. One could for example group variables at one height together or use 3d convolution.

Our results further support the idea – already proposed by

One of the aims of this study was to assess whether it is possible to use a simplified reality – in this case a very simple GCM without seasonal cycle – to develop a method that also works on more complex GCMs. We showed that, for the problem of forecasting the model weather, this seems to be the case: the network architecture developed by

The second problem we addressed was using the trained networks to create climate runs.

The code developed and used for this study is available in the accompanying repository at

Here, we outline very briefly how the local dimension

First, we define the distance between the 2-D atmospheric fields at times

These are effectively logarithmic returns in phase space, corresponding to cases where the field

The local dimension is an instantaneous metric, and

The videos in the Supplement show the evolution of the GCM (upper panels) and the evolution of the climate network trained on the GCM (lower panels), both started from the same initial conditions. The left panels show zg500, the right panels wind speed at 300 hPa. Each video corresponds to one GCM. The videos can be accessed at

SS conceived the study and methodology, performed the analysis, produced the software and drafted the article. GM contributed to drafting and revising the article.

The authors declare that they have no conflict of interest.

We thank Peter Dueben for providing the ERA5 data. Sebastian Scher was funded by the Department of Meteorology (MISU) of Stockholm University. The computations were done on resources provided by the Swedish National Infrastructure for Computing (SNIC) at the High Performance Computing Center North (HPC2N) and National Supercomputer Centre (NSC).

This research has been supported by Vetenskapsrådet (grant no. 2016-03724).

This paper was edited by Samuel Remy and reviewed by Peter Düben and two anonymous referees.