Temperature forecasting by deep learning methods
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany
Abstract. Numerical weather prediction (NWP) models solve a system of partial differential equations based on physical laws to forecast the future state of the atmosphere. These models are deployed operationally, but they are computationally very expensive. Recently, the potential of deep neural networks to generate bespoken weather forecasts has been explored in a couple of scientific studies inspired by the success of video frame prediction models in computer vision. In this study, a simple recurrent neural network with convolutional filters, called ConvLSTM, and an advanced generative network, the Stochastic Adversarial Video Prediction (SAVP) model, are applied to create hourly forecasts of the 2 m temperature for the next 12 hours over Europe. We make use of 13 years of data from the ERA5 reanalysis, of which 11 years are utilized for training and one year each is used for validating and testing. We choose the 2 m temperature, total cloud cover and the 850 hPa temperature as predictors and show that both models attain predictive skill by outperforming persistence forecasts. SAVP is superior to ConvLSTM in terms of several evaluation metrics, confirming previous results from computer vision that larger, more complex networks are better suited to learn complex features and to generate better predictions. The 12-hour forecasts of SAVP attain a mean squared error (MSE) of about 2.3 K2, an anomaly correlation coefficient (ACC) larger than 0.85, a Structural Similarity Index (SSIM) of around 0.72, and a gradient ratio (rG) of about 0.82. The ConvLSTM yields a higher MSE (3.6 K2), a smaller ACC (0.80), and SSIM (0.65), but a slightly larger rG (0.84). The superior performance of SAVP in terms of MSE, ACC, and SSIM can be largely attributed to the generator. A sensitivity study shows that a larger weight of the GAN component in the SAVP loss leads to even better preservation of spatial variability at the cost of a somewhat increased MSE (2.5 K2). Including the 850 hPa temperature as an additional predictor enhances the forecast quality and the model also benefits from a larger spatial domain. By contrast, adding the total cloud cover as predictor or reducing the amount of training data to eight years has only small effects. Although the temperature forecasts obtained in this way are still less powerful than contemporary NWP models, this study demonstrates that sophisticated deep neural networks may achieve considerable forecast quality beyond the nowcasting range in a purely data-driven way.
Bing Gong et al.
Status: final response (author comments only)
-
RC1: 'Comment on gmd-2021-430', Anonymous Referee #1, 05 Apr 2022
The present paper presents the use of a stochastic adversarial video prediction model to forecasting the two-meter temperature. While the paper is interesting to read and the conclusions seem valid, I do have several points that I think would have to be addressed before the paper could be considered for publication in GMD. In particular:
1) The present paper is a contribution to an increasing line of research on the application of a deep-learning based methodology to weather prediction, in particular the use of an existing video prediction model applied to weather prediction. This line of work has been of great interest when first showcased through the contributions of Weyn et al. and Dueben et al., just to name a few, but it does not feel that the present paper adds anything substantially new besides using a different architecture for the same problem. One main issue is that meteorological data is fundamentally different from generic video data in that it follows a well-defined system of partial differential equations. In this purely data-driven approach it seems that one has to be willing to throw away more than a hundred years of research on the understanding of these governing equations of hydro-thermodynamics just to be able to use an off-the-shelf video prediction architecture, which does not seem to come close to where traditional numerical methods can go today in terms of relevant forecast metrics. The question to ask is hence whether is is indeed the right approach going forward, or whether one should strive to combine data-driven approaches with the inductive bias as provided by the fundamental laws of physics. There is a growing interest in physics-informed machine learning, which allows combining differential equations with data-driven machine learning which in a way seems more appropriate for the present problem at hand. If that was possible for the present model then I think the paper would become much stronger and more suitable for what would actually be required for weather prediction.
2) The selection of features (cloud cover, 850 hPa temperature and two-meter temperature) seems slightly arbitrary. While the authors do provide some justification for the selection of these parameters, there are many more parameters that influence the evolution of the two-meter temperature. The authors then state "A more systematic variable selection process as is typical for data science studies is beyond the scope of this paper.", but I do not believe this is a justifiable statement here, because the authors do carry out a data science study in this paper. Again, if this was the first paper to be written on using a video prediction model for weather prediction this point could be easily forgiven, but as there are many other papers out there that provide proof of concept that such models can predict the future weather to some degree I think some more work needs to be done here to justify this feature selection, and to show which features have to be selected to get the best possible model results. In the machine learning literature it is customary to carry out ablation studies that showcase the importance of various aspects of the data/model components being used, and I think such a study would be beneficial here as well.
3) The baseline comparison model used is a standard convolutional LSTM model. This is the simplest possible model for video frame prediction and it is well-known to perform rather poorly as it exhibits an excessive amount of diffusion. Thus, beating this baseline is rather straightforward so I wonder if the comparison of the authors' model to this simple model really yields a lot of information about the absolute strength of this model. It would be great to add a somewhat more state-of-the-art comparison model as well to be truly able to assess how good the stochastic adversarial video prediction model is for the present problem. Related to this, it would also be useful to add the performance metrics of traditional weather forecasting models as point of comparsion. Right now this information is just provided in the Discussions section but it would be nice to show these metrics in the plots as well.
4) Owing to the interest in data-driven weather forecasting, a standard benchmark "WeatherBench" has been proposed to facilitate comparison with other deep learning based models. The present paper does not use this benchmark but rather investigates the model performance over Europe instead. This makes positioning this work within the wider literature rather challenging so I wonder if it would not be better to provide these results instead (or in addition) for the WeatherBench dataset as well. Again, this would facilitate comparison with other approaches that have been proposed for data-driven weather forecasting.
In summary, while the present paper is interesting to read I do believe there isn't a sufficient amount of novelty yet that warrants publication in GMD in its present form. The main contribution of picking a video prediction model and applying it to weather forecasting has been done several times in the recent literature so this does not feel novel enough anymore unless other open aspects of data-driven weather prediction are investigated in addition. These could be, as indicated above, a combination with differential-equations based models, a more thorough investigation of which parameters are responsible for the success of the proposed model, beating other existing approaches for the exact same problem domain, just to name a few.
-
AC2: 'Reply on RC1', BING GONG, 13 May 2022
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2021-430/gmd-2021-430-AC2-supplement.pdf
-
AC2: 'Reply on RC1', BING GONG, 13 May 2022
-
CEC1: 'Comment on gmd-2021-430', Juan Antonio Añel, 21 Apr 2022
Dear authors,
After checking your manuscript, it has come to our attention that it does not comply with our Code and Data Policy.
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived your code in a server of the Fz-Juelich. However, this is not a suitable repository, as you can read in our policy. Therefore, you must move your code and data to one of the accepted repositories. In this way, you must include in a potential reviewed version of your manuscript the modified 'Code and Data Availability' section, the DOI of the code. Moreover, you must reply as soon as possible to this comment with the link to the new repository so that it is available for the peer-review process, as it should be.
Also, I have tried to access the training samples of your experiments from the link provided in your GitLab, and I have not been able to get them. When moving your data to the new repository, check that all the samples used in your work are available and correctly identified.
Moreover, I see that you provide several Jupyter notebooks. When moving them, please, try to avoid using paths that could point to the local filesystems of your research centre.
Finally, the hyperlink in the Code Availability Section of your manuscript is broken. Currently, it points to "https://gitlab.jsc.fz-510/". When you update the repository to the new one, be careful that the hyperlink is correct.Juan A. Añel
Geosci. Model Dev. Executive Editor-
AC1: 'Reply on CEC1', BING GONG, 21 Apr 2022
Dear Editor,
Thank you so much for your suggestion.
The exact version of the model used to produce the results used in this paper is archived on zenodo (https://doi.org/10.5281/zenodo.6308774).
The current version of the model described in the paper is also available from the project website: https://gitlab.jsc.fz-juelich.de/esde/machine-learning/ambs/-/tree/GMD1 under the MIT license.
Since the raw dataset is quite large (>TB), in the README file of our code repository, we describe how to access the full ERA5 dataset from ECMWF MARS archive. However, we prepared a small dataset with 1 year of data to run the script, which can be downloaded from the following link http://doi.org/10.23728/b2share.744bbb4e6ee84a09ad368e8d16713118
In addition, we recommend following the guidance in the README file to run the code, rather than using Jupyter notebook (since Jupyter notebook is only used for our development phase)
Please let me know if you have further requests.
Many thanks!
Best,
Bing
-
AC1: 'Reply on CEC1', BING GONG, 21 Apr 2022
-
CEC2: 'Comment on gmd-2021-430', Juan Antonio Añel, 21 Apr 2022
Dear authors,
After checking your manuscript, it has come to our attention that it does not comply with our Code and Data Policy.
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived your code in a server of the Fz-Juelich. However, this is not a suitable repository, as you can read in our policy. Therefore, you must move your code and data to one of the accepted repositories. In this way, you must include in a potential reviewed version of your manuscript the modified 'Code and Data Availability' section, the DOI of the code. Moreover, you must reply as soon as possible to this comment with the link to the new repository so that it is available for the peer-review process, as it should be.
Also, I have tried to access the training samples of your experiments from the link provided in your GitLab, and I have not been able to get them. When moving your data to the new repository, check that all the samples used in your work are available and correctly identified.
Moreover, I see that you provide several Jupyter notebooks. When moving them, please, try to avoid using paths that could point to the local filesystems of your research centre.
Finally, the hyperlink in the Code Availability Section of your manuscript is broken. Currently, it points to "https://gitlab.jsc.fz-510/". When you update the repository to the new one, be careful that the hyperlink is correct.Juan A. Añel
Geosci. Model Dev. Executive Editor -
RC2: 'Comment on gmd-2021-430', Anonymous Referee #2, 05 May 2022
This is a review of the paper "Temperature forecasting by deep learning methods" by Bing Gong, Michael Langguth, et al.
This paper describes the use of an existing generative adversarial neural network architecture and approach for video prediction, SAVP, to the problem of predicting the evolution of the temperature field over central Europe. While the results are not yet competitive with operational weather forecasts, the input data is relatively coarse and very few predictors are used, so this is not surprising. The authors perform several ablation studies to identify the main contributions to the model's strength, although I have some minor criticisms about the details of some of these. Nevertheless, I believe the paper represents an interesting extension to the existing literature, within the area of purely data-driven approaches to weather forecasting.
I have three main comments about the authors' approach:
1) The authors use a generative model, with an explicit sampling step, which allows them to generate multiple forecasts for a single input. However, the authors do not seem to explore this aspect at all, apart from a brief mention of probabilistic prediction in the conclusion section. There are a large number of ensemble verification metrics avilable to assess the calibration of the generated ensemble, i.e., to see to what extent the ground truth is interchangeable with a generated ensemble member. Some simpler ones include spread-skill plots and ensemble rank histograms. This be a route that the authors do not wish to pursue yet and leave for future work, but it might be interesting to at least have some idea of how different the generated sequences can be for the same given input data, even for one or two case studies.
2) Regarding predictors, am I right in thinking that the network is given no direct information indicating where in the diurnal cycle it is starting to forecast from? E.g. time of day and day of year, or total incoming solar radiation, etc.? I.e. it has to infer this from the patterns seen in the first 12 hours' data? If so, this seems like a strange choice, and one might imagine the model occasionally becoming confused by unusual temperature variations in the first 12 hours. Are there any signs that something like this happens? More generally, if you look at some of the worst predictions (e.g. by average MSE over the 12 hours), is there anything interesting about the failure modes, which may hint at extra predictors to use? I imagine the authors may wish to use a much larger set of meteorological variables in future work!
3) Regarding the experiment that varied the domain size, I understand the authors believe that the varying domain (which the metrics are computed over) contributes majorly to the difference in scores -- the larger domains have larger proportions of water, which leads to lower MSE, etc. As a result, I don't feel this part of the paper contributes much insight in its current form. Can I suggest that the evaluation is performed on the same physical domain each time, e.g. the 72 x 44 central region? I.e., when the larger domains are being used, they are cropped to the central 72 x 44 region before various metrics are calculated. In this way, the comparison is fairer, and the effect of 'larger context' can be isolated from the varying evaluation domain.
For similar reasons, I am somewhat skeptical of the 'sensitivity to number of years of training data' result, since (if I understand correctly) the evaluation is performed on three different years. These themselves may be more or less difficult to predict. If it is feasible to re-run this part of the work to avoid evaluating on different years, this would seem like a good idea. If not, I suggest they at least add a corresponding caveat to the results discussion!
Minor comments:
1) What is the ConvLSTM model trained on? I couldn't spot this easily in the text. Is it just trained to minimise MSE (i.e., L^2 error)?
2) I believe the original ConvLSTM paper is normally cited as Shi et al. (2015), not Xingjian et al. (2015)?
3) In Figure 5 (and similar figures), I assume the three lines for each model correspond to the three different datasets (evaluation/training years) used? This could be made a bit clearer, e.g. in the caption.
Finally, here are a few small typos/grammatical mistakes, etc., that I spotted:
Line 17: as additional predictor -> as an additional predictor
Line 206: of a 24 time steps -> of 24 time steps
Line 207: This results into about -> This results in about
Line 235: which encodes -> which encode
Line 238: no comma needed after 'both'
Line 257: condinoned -> conditioned
Line 258: missing Z after 'latent space'
Line 467: for a 12-hour forecasts, is attained -> for a 12-hour forecast is attained
Line 468: higher spatial solutions -> higher spatial resolution
Line 480: repeated word 'motivate'
Line 485: deep neural can -> deep neural networks can
Line 496: into -> in
Line 518: as list in -> as listed in
Line 521: ration -> ratio
Line 525: I + J -> I x J
Line 533: and each of the day -> and each hour of the day
Line 560: I think 'disposal' should be something else, but I am not sure what?-
AC3: 'Reply on RC2', BING GONG, 13 May 2022
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2021-430/gmd-2021-430-AC3-supplement.pdf
-
AC3: 'Reply on RC2', BING GONG, 13 May 2022
Bing Gong et al.
Bing Gong et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
517 | 172 | 20 | 709 | 6 | 5 |
- HTML: 517
- PDF: 172
- XML: 20
- Total: 709
- BibTeX: 6
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1