There is growing interest in data-driven weather prediction (DDWP), e.g., using convolutional neural networks such as U-NET that are trained on data from models or reanalysis. Here, we propose three components, inspired by physics, to integrate with commonly used DDWP models in order to improve their forecast accuracy. These components are (1) a deep spatial transformer added to the latent space of U-NET to capture rotation and scaling transformation in the latent space for spatiotemporal data, (2) a data-assimilation (DA) algorithm to ingest noisy observations and improve the initial conditions for next forecasts, and (3) a multi-time-step algorithm, which combines forecasts from DDWP models with different time steps through DA, improving the accuracy of forecasts at short intervals. To show the benefit and feasibility of each component, we use geopotential height at 500 hPa (Z500) from ERA5 reanalysis and examine the short-term forecast accuracy of specific setups of the DDWP framework. Results show that the spatial-transformer-based U-NET (U-STN) clearly outperforms the U-NET, e.g., improving the forecast skill by 45 %. Using a sigma-point ensemble Kalman (SPEnKF) algorithm for DA and U-STN as the forward model, we show that stable, accurate DA cycles are achieved even with high observation noise. This DDWP+DA framework substantially benefits from large (

Motivated by improving weather and climate prediction, using machine learning (ML) for data-driven spatiotemporal forecasting of chaotic dynamical systems and turbulent flows has received substantial attention in recent years

List of acronyms and technical ML and DA terms along with their brief descriptions.

The increasing interest

Many of the DDWP models built so far are physics agnostic and learn the spatiotemporal evolution only from the training data, resulting sometimes in physically inconsistent predictions and an inability to capture key invariants and symmetries of the underlying dynamical system, particularly when the training set is small

Another approach – which has received less attention particularly in weather and climate modeling – is to enforce the appropriate symmetries, which are connected to conserved quantities through Noether's theorem

DA is an essential component of modern weather forecasting

In recent years, there has been a growing number of studies at the intersection of ML and DA

To provide proofs of concept for the DDWP model and the combined DDWP+DA framework, we use sub-daily 500 hPa geopotential height (Z500) from the ECMWF Reanalysis 5 (ERA5) dataset

Introducing the spatial-transformer-based U-NET that can capture rotational and scaling features in the latent space for DDWP modeling and showing the advantages of this architecture over a conventional encoder–decoder U-NET.

Introducing the DDWP+DA framework, which leads to stable DA cycles without the need for any localization or inflation by taking advantage of the large forecast ensembles produced in a data-driven fashion using the DDWP model.

Introducing a novel multi-time-step method for improving the DDWP+DA framework. This framework utilizes virtual observations produced using more accurate DDWP models that have longer time steps. This framework exploits the non-trivial dependence of the accuracy of autoregressive data-driven models on the time step size.

We use the ERA5 dataset from the WeatherBench repository

The DDWP models used in this paper are trained on Z500 data without access to any other atmospheric fields that might affect the atmosphere's spatiotemporal evolution. Once trained on past Z500 snapshots sampled at every

A schematic of the architecture of U-STNx. The architecture captures rotation and scaling transformations between the input to the latent space and the decoded output owing to the spatial transformer module implemented through the affine transformation,

The network takes in an input snapshot of Z500,

The spatial transformer

We highlight that in this paper we are focusing on capturing effects
of translation, rotation, and scaling of the input field, because
those are the ones that we expect to matter the most for the synoptic
patterns on a 2D plane. Furthermore, here we focus on an architecture
with a transformer that acts only on the latent space. More complex
architectures, with transformations like Eq. (

Finally we point out that without the transformer module,

The decoding block is a series of deconvolution layers (convolution with zero-padded up-sampling) concatenated with the corresponding convolution outputs from the encoder part of the U-NET. The decoding blocks bring the latent space

The loss function

Note that the use of U-NET is inspired from the work by

Table presenting information on the U-STNx and U-NETx architecture with the optimal set of hyperparameters that have been obtained after extensive trial and error. Note that apart from the affine transformation and bilinear interpolation layer, the U-STNx and U-NETx architectures are similar. The networks have been implemented in Tensorflow and Keras.

For DA, we employ the SPEnKF algorithm, which unlike the EnKF algorithm, does not use random perturbations to generate an ensemble but rather uses an unscented transformation

In the DDWP+DA framework, shown schematically in
Fig.

The framework for a synergistic integration of a DA algorithm (SPEnKF) with a DDWP (U-STN1). Once the DDWP+DA framework is provided with a noisy

We start with a noisy initial condition

To do SPEnKF, an ensemble of states at the 23rd hour of each DA cycle (24 h is one DA cycle) is generated using a symmetric set of sigma points

Once the ensembles are generated via Eq. (

Finally, we remark that often with low ensemble sizes, the background covariance matrix,

First, we compare the performance a U-STN and a conventional U-NET, where the only difference is in the use of the spatial transformer module in the former. Using U-STN12 and U-NET12 as representatives of these architectures, Fig.

Anomaly correlation coefficient (ACC) calculated between Z500 anomalies of ERA5 and Z500 anomalies predicted using U-STN12 or U-NET12 from 30 noise-free, random initial conditions. The solid lines and the shading show the mean and the standard deviation over the 30 initial conditions.

To further see the source of this improvement, Fig.

Examples of the spatiotemporal evolution of Z500 predicted from a noise-free initial condition (

Overall, the results of Figs.

A benchmark for different DDWP models has been shown in

To analyze the performance of the DDWP+DA framework, we use U-STN1 as the DDWP model and SPEnKF as the DA algorithm, as described in Sect.

In Fig.

One last point to make here is that within each DA cycle, the maximum forecast accuracy is not when DA occurs but 3–4 h later (this is most clearly seen for the case with

Panel

One might wonder how the performance of the DDWP model (with or without DA) depends on

Panels

This dependence on

Based on the trends seen in Fig.

Schematic of the multi-time-step DDWP+DA framework. The U-STN12 model provides forecasts every 12 h, which are assimilated as virtual observations using SPEnKF into the U-STN1+SPEnKF framework that has a 24 h DA cycle for assimilating noisy observations. At 12th hours, the U-STN12 forecasts are more accurate than those from the U-STN1 model, enabling the framework to improve the prediction accuracy every 12th hour, thereby improving the initial condition used for the next forecasts before DA with noisy observations (every 24 h).

Figure

Performance of the multi-time-step U-STNx+SPEnKF framework (with virtual observations at the 12th hour of every 24 h DA cycle) compared to that of the U-STN+SPEnKF framework for the case with

Same as Fig.

The multi-time-step framework with assimilated virtual observations introduced here improves the forecasts of short-term intervals by exploiting the non-trivial dependence of the accuracy of autoregressive, data-driven models on time step size. While hourly forecasts of Z500 may not be necessarily of practical interest, the framework can be applied in general to any state variable and can be particularly useful for multi-scale systems with a broad range of spatiotemporal scales. A similar idea was used in

In this paper, we propose three novel components for DDWP frameworks to improve their performance: (1) a deep spatial transformer in the latent space to encode the relative spatial relationships of features of the spatiotemporal data in the network architecture, (2) a stable and inexpensive ensemble-based DA algorithm to ingest noisy observations and correct the forecast trajectory, and (3) a multi-time-step algorithm, in which the accurate forecasts of a DDWP model that uses a larger time step are assimilated as virtual observations into the less accurate forecasts of a DDWP that uses a smaller time step, thus improving the accuracy of forecasts at short intervals.

To show the benefits of each component, we use downsampled Z500 data from ERA5 reanalysis and examine the short-term forecast accuracy of the DDWP framework. To summarize the findings, we present the following points.

As show in Sect.

As shown in Sect.

As shown in Sect.

The DDWP models are currently not as accurate as operational NWP models

Finally, we point out that while here we focus on weather forecasting, the three components can be readily adopted for other parts of the Earth system, such as ocean and land, for which there is a rapid growth of data and need for forecast and assimilation

In this section, we have show an example of prediction performance with T850 instead of Z500. In Fig.

In this section, we present Table

Performance of U-STN12 and U-NET12 on T850. Shading represents standard deviation over 30 initial conditions.

Comparison of U-STN12 with two WeatherBench models.

All codes used in this study are publicly
available at

AC, MM, and KK designed the study. AC conducted research. AC and PH wrote the article. All authors analyzed and discussed the results. All authors contributed to writing and editing of the article.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We thank Jaideep Pathak, Rambod Mojgani, and Ebrahim Nabizadeh for helpful discussions. We would like to thank Tom Beucler, one anonymous referee and the editor, whose insightful comments, suggestions, and feedback have greatly improved the clarity of the article. This work was started at National Energy Research Scientific Computing Center (NERSC) as a part of Ashesh Chattopadhyay's internship in the summer of 2020, under the mentorship of Mustafa Mustafa and Karthik Kashinath, and continued as a part of his PhD work at Rice University, under the supervision of Pedram Hassanzadeh. This research used resources of NERSC, a U.S. Department of Energy Office of Science User Facility operated under contract no. DE-AC02-05CH11231. Ashesh Chattopadhyay and Pedram Hassanzadeh were supported by ONR grant N00014-20-1-2722 and NASA grant 80NSSC17K0266. Ashesh Chattopadhyay also thanks the Rice University Ken Kennedy Institute for a BP high-performance computing (HPC) graduate fellowship. Eviatar Bach was supported by the University of Maryland Flagship Fellowship and Ann G. Wylie Fellowship, as well as by Monsoon Mission II funding (grant IITMMMIIUNIVMARYLANDUSA2018INT1) provided by the Ministry of Earth Science, Government of India.

This research has been supported by the U.S. Department of Energy, Office of Science (grant no. DE-AC02-05CH11231), the Office of Naval Research (grant no. N00014-20-1-2722), and the National Aeronautics and Space Administration (grant no. 80NSSC17K0266).

This paper was edited by Xiaomeng Huang and reviewed by Tom Beucler and one anonymous referee.