the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Data-driven Global Subseasonal Forecast Model (GSFM v1.0) for intraseasonal oscillation components
Abstract. As a challenge in the construction of a “seamless forecast” system, improving the prediction skills of subseasonal forecasts is a key issue for meteorologists. In view of the evolution characteristics of numerical models and recent deep learning models for subseasonal forecasts, as forecast times increase, forecast results tend to become intraseasonal low-frequency components, which are essential to the change in general circulation on the subseasonal timescale as well as persistent extreme weather. In this paper, the Global Subseasonal Forecast Model (GSFM v1.0) first extracted the intraseasonal oscillation (ISO) components of atmospheric signals and used an improved deep learning model (SE-ResNet) to train and predict the ISO components of geopotential height at 500 hPa (Z500) and temperature at 850 hPa (T850). The results show that the 10–30 day prediction performance of the model used in this paper is better than that of the model trained directly with original data. Compared with other models/methods, the SE-ResNet model has a good ability to depict the subseasonal evolution of the ISO components of Z500 and T850. In particular, although the CFSv2 results have a better prediction performance through 10 days, the SE-ResNet model is substantially superior to CFSv2 through 10–30 day, especially in the middle and high latitudes. The SE-ResNet model also has a better effect in predicting 3–8 planetary waves, which leads to the difference in model prediction performance in extratropical areas. A case study shows that the SE-ResNet model depicted the phase change and propagation characteristics of planetary waves well. Thus, the application of data- driven subseasonal forecasts of atmospheric ISO components may shed light on improving the skill of seasonal forecasts.
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(4084 KB)
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2022-146', Chiem van Straaten, 08 Aug 2022
General impression
This paper presents a deep learning framework for sub-seasonal forecasting. The novelty relative to Rasp et al (2020) is the self-attention mechanism. An additional innovation is the extraction of intra-seasonal oscillation components. This extraction eases the learning task, as the only predictable atmospheric motions at the sub-seasonal lead time are the low-frequency components. The authors compare their framework to a version without self-attention and to numerical forecasts. The comparison is detailed, with multiple scores and a case study. Possibilities for clarification do however exist. This could increase the trustworthiness of the results.
Overall comment:
A big finding of the study is that the deep learning models outperform the CFSv2 model beyond 10 days. But currently the comparison lacks information, rendering it un-trustworthy. The authors say they use the WeatherBench dataset but CFS is not part of the benchmark dataset. How was it obtained, and especially, how was it processed?
In the verification you make use of the ISO components (e.g. fig 4a), which requires the removal of the seasonal cycle. Was the seasonal cycle for CFS estimated from model reforecasts or something else? Choices like that influence results (see for instance Manrique et al 2020). Also, how was lead time treated in your CFS-processing? At a lead time of 1 days for instance, right after initialization, it is impossible to subtract a 15 day rolling average of the previous days, which is step 2 in your ISO procedure (see section 2.1). Because you often verify the ISO components of circulation, it seems unfair if CFS is the only unfiltered model in that comparison. Please clarify.
Your figures contain a hint that the processing is unfair. Performance of the CFS model is heavily curved: it dips below that of the climatology benchmark and only later levels off (fig, 4a, 6a, 8a). Such behavior is unexpected for a numerical model. The numerical skill in z500 should directly level off at the climatology (Figure 4a, Buizza & Leutbecher 2015).Specific comments
L42: Is Mayer et al 2021 the best reference for the chaotic nature of the atmosphere? Why not original work like that of Ed Lorenz?
L62: Unclear in what way predictability relates to spatio-temporal scale. I know myself that larger / low-frequency is more predictable, but perhaps good to explicitly state this. Also you can refer to Buizza and Leutbecher (2015).
L89-90: Explain why residual connections give the network this ability.
L115 Vague: “the contributions of evolution among different factors to the forecast may be different.” Please clarify.
Section 2.1: Even with the reference to Hsu et al (2015) this is too short a description: “Remove other ISO signals”. What is ‘other’ in this context? Also it would be good to explicitly mention the data you are filtering. My suggestion is to move the data-description at the end of section 2.2. (weatherbench) to here. Then you can also clarify if you do the filtering on a gridpoint basis or not. Also please mention the train/validation/test splits that you make.
L138-145: Introduce ResNet and what a residual block is and refer to the original paper (He et al., 2015), just like you do for the self-attention mechanism.
L162-165: I miss a mention of the domain.
L 203: here you start the discussion of forecasts. You mention that there are two types of predictions, one driven by the original data and one driven by the ISO components. This seems to concern the types of input, not the target against which it is trained (for both forecast models the task seems to be to predict unfiltered Z500/T850). Do I understand this correctly? Please outline the two variations already in the methods.
L 215: Statement: “Furthermore, in this case, the Z500 values predicted by the ISO components are closer to the ERA5 ISO components, with a mean RMSE of 575.96 (m 2 s -2 )”. It is obvious that the one driven by filtered information will indeed be closer to that information than the one without access to the filtered information. But it is confusing because in the sentences above this one, you discuss scores against unfiltered ERA5, and this seems also be the content of Figure 2. Perhaps make clear that here you discuss scoring against ISO components, and that that is not shown in Figure 2. Add something like “(not shown)”.
Figure 3: From the title of panel a and b and the y labels of c and d I understand the figure presents both scores against ISO components and scores against unfiltered ERA5. Is that correct? Perhaps introduce these two variations of scoring in section 2.3, formula 1, where you would say that ti,j,k is either the filtered or unfiltered component. Or… if you always score against the ISO component (which is not clear to me), then say that ti,j,k is the ISO component.
L307-316: You present the scores stratified against latitude. The conclusion that differences are small in the tropics is not surprising. Z500 in the tropics is a nearly constant field with hardly any variability.
Textual comments
L14-15 “Forecast results tend to become intraseasonal low-frequency components”. Unclear what is meant with this. At large forecast times, model output can of course still be high frequency. Do you mean that “as forecast time increases, the only skillful forecasts are those of low-frequency components”?
L24 CFSv2 acronym mentioned without definition.
L27-28 Planetary wave numbers 3-8.
L144-145 But “with” zero?
L177: More commonly known as “anomaly correlation coefficient”
Figure 7: mention in figure caption that z500 is in contours, and t2m in shading.
Additional literature referred to:
Manrique-Suñén, A.; Gonzalez-Reviriego, N.; Torralba, V.; Cortesi, N. & Doblas-Reyes, F. J. Choices in the Verification of S2S Forecasts and Their Implications for Climate Services Monthly Weather Review, American Meteorological Society, 2020, 148, 3995 - 4008
Buizza, R. & Leutbecher, M. The forecast skill horizon Quarterly Journal of the Royal Meteorological Society, Wiley Online Library, 2015, 141, 3366-3382
Citation: https://doi.org/10.5194/gmd-2022-146-RC1 -
AC2: 'Reply on RC1', Dingan Huang, 22 Aug 2022
General impression
This paper presents a deep learning framework for sub-seasonal forecasting. The novelty relative to Rasp et al (2020) is the self-attention mechanism. An additional innovation is the extraction of intra-seasonal oscillation components. This extraction eases the learning task, as the only predictable atmospheric motions at the sub-seasonal lead time are the low-frequency components. The authors compare their framework to a version without self-attention and to numerical forecasts. The comparison is detailed, with multiple scores and a case study. Possibilities for clarification do however exist. This could increase the trustworthiness of the results.
Response: Thanks a lot for your encouraging and suggestive comments. We have improved our manuscript as you suggested. Please see details the responses as follows.
Overall comment:
1.A big finding of the study is that the deep learning models outperform the CFSv2 model beyond 10 days. But currently the comparison lacks information, rendering it un-trustworthy. The authors say they use the WeatherBench dataset but CFS is not part of the benchmark dataset. How was it obtained, and especially, how was it processed?
Response: Thank you for your constructive comment. The data used in this paper are listed as follow. (1) Model training data are provided by the WeatherBench challenge. A detailed description can be found in studies of Rasp et al. (2020), and you can get the latest data set on https://github.com/pangeo-data/WeatherBench. The data set mainly contains ERA5 data from 1979 to 2018, and the horizontal resolution of the data set used in this paper is 5.625°×5.625°. (2) The forecast results of Z500 and T850 in the CFSv2 model data set for the next 30 days. This dataset was downloaded from https://www.ncei.noaa.gov containing data in 2000-2018.
The original CFSv2 forecast data we download covers the global area with a resolution of 1°×1°. The lead time we used ranges from 1-30 days. We interpolated the CFSv2 data into the same grid points as ERA5 data in this paper (5.625°×5.625°). Accordingly, we have added the description of about how we get and process the CFSv2 dataset as well as ERA5 data set in more detail in the Sec. 2.1 (data and method) of our new MS.
Dataset citation:
Saha, S., Moorthi, S., Wu, X., Wang, J., Nadiga, S., Tripp, P., Behringer, D., Hou, Y., Chuang, H., Iredell, M., Ek, M., Meng, J., Yang, R., Mendez, M. P., van den Dool, H., Zhang, Q., Wang, W., Chen, M., and Becker, E.: The NCEP Climate Forecast System Version 2, Journal of Climate, 27(6), 2185-2208, https://doi.org/10.1175/JCLI-D-12-00823.1, 2014.
Hersbach H, Bell B, Berrisford P, et al. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 2020, 146(730).
2.In the verification you make use of the ISO components (e.g. fig 4a), which requires the removal of the seasonal cycle. Was the seasonal cycle for CFS estimated from model reforecasts or something else? Choices like that influence results (see for instance Manrique et al 2020).
Also, how was lead time treated in your CFS-processing? At a lead time of 1 days for instance, right after initialization, it is impossible to subtract a 15 day rolling average of the previous days, which is step 2 in your ISO procedure (see section 2.1). Because you often verify the ISO components of circulation, it seems unfair if CFS is the only unfiltered model in that comparison. Please clarify.
Response: Thank you for your critical suggestion. The filtering method includes three steps. In the first step, we subtract the 90-day low-pass filtered component so as to remove the slow-varying seasonal cycle.
Before this step, the daily anomalies used to calculate the 90-day low-pass filtered component are obtained by subtracting ERA5 climate values from the corresponding variables. The second step is to remove the interannual and interdecadal anomalies by subtracting the last 15-day running mean. And the last step is to remove the synoptic scale components by taking a 5-day running mean.
Therefore, this method is try to take the 10-30 day ISO component. For example, taking the time series of Z500 at an arbitrary grid point (2.8125°N,180°) of 181-300 days in 2017 from CFS data, the filter time series display clear intraseasonal feature and is well consistent with the one derived from 10-30 day Lanczos filtering method, and the correlation coefficient is 0.72 (Fig. S1a, see the supplemental figure file).
Considering the choices of seasonal cycle from different dataset may influence the results (e.g. Manrique et al 2020), we also show the filtered results using 90-day low-pass-filtered climate values (2000-2010) calculated from both ERA5 and CFS data in Fig. S1b, respectively. It can be found that, when filtering CFS data, the filter time series calculated using ERA5 or CFS data show highly consistency. Furthermore, we also demonstrated the overall difference in filtering CFS data using both CFS and ERA5 climate values and quantified the difference of between the two filtering results (Fig. S1c). The differences consistently remain at a low level, with a value of 2.45 m2 s-2 approximately. However, due to the limited time available in CFS data, the climate data used in the filtering process in this paper are calculated from ERA5 data from 1981 to 2010. To better clarify this, we have added statements in our new MS.
In the filtering step, we firstly divide the whole dataset into 30 sub-datasets according to the lead time (+1~+30day). In each sub-dataset, we place forecasts of different dates with the same lead time together following the time sequence. For example, for the part with lead time of 10 days, the data contained includes (1) 2017.1.1 forecasting 2017.1.11 (2) 2017.1.2 forecasting 2017.1.12 and so on, then we got the time series from 2017.1.11 to 2018.12.31. The filtering step is carried out in each part separately. For the filtering method we used, please refer to the corresponding section. Some of the data in 2016 is also used in the filtering part so that we can get a result for every day in 2017-2018. Accordingly, we have added the description of about how we process the lead time in CFSv2 dataset in detail in the Sec. 2.1 (data and method) of our new MS.
3.Your figures contain a hint that the processing is unfair. Performance of the CFS model is heavily curved: it dips below that of the climatology benchmark and only later levels off (fig, 4a, 6a, 8a). Such behavior is unexpected for a numerical model. The numerical skill in z500 should directly level off at the climatology (Figure 4a, Buizza & Leutbecher 2015).
Response: Thank you for your critical comment. The climatology benchmark shown in this paper are calculated from ERA5 data. Since it is a state-of-art reanalysis dataset (Hersbach et al. 2020), we consider it as a ground-truth in this paper. If we used the forecasting model derived climatology like the way in Buizza and Leutbecher (2015) and then compare to the forecasting result with the ERA5 data, we can see the prediction effect of CFS is gradually approaching the climatology of CFS model with the increase of forecast time (Fig. S2). However, it can be found that the skill of ERA5 climatology benchmark is clearly better than CFS climatology forecasting (the RMSE is 577.62 m2 s-2 vs 598.88 m2 s-2 and the ACC is 70.01% vs 67.48%) . That is because we used the ERA5 reanalysis as a ground-truth in this paper as mentioned above. Accordingly, we have added the CFS climatology forecasting result and description in our new MS.
Specific comments
4.L42: Is Mayer et al 2021 the best reference for the chaotic nature of the atmosphere? Why not original work like that of Ed Lorenz?
Response: Ed Lorenz’s work ‘Deterministic nonperiodic flow’ has been added to the reference.
Lorenz, E. N.: Deterministic nonperiodic flow, Journal of the Atmospheric Science, 20, 130-141, https://doi.org/10.1007/978-0-387-21830-4_2, 1963.
5.L62: Unclear in what way predictability relates to spatio-temporal scale. I know myself that larger / low-frequency is more predictable, but perhaps good to explicitly state this. Also you can refer to Buizza and Leutbecher (2015).
Response: What we want to express here is that weather systems with different spatial-temporal scales often have different predictability. For example, for strong convective weathers such as thunderstorms, hails and tornados, the upper limit of predictability for these weathers will be several hours. For synoptic systems, the predictability can reach up to two weeks and for systems of a planetary scale, it will be much longer. We state this in our new MS and also refer to Buizza and Leutbecher (2015).
6.L89-90: Explain why residual connections give the network this ability.
Response: Resnet is developed from convolutional networks to solve the degradation problem that could happen when the network is too deep. According to Rasp et al. 2021, the use of multiple layers of residuals can always retain the data features of the previous layer, while continuously digging deeper into the data relationship so that the network can memorize previous information in the process of extracting information. We added this statement in our new MS
7.L115 Vague: “the contributions of evolution among different factors to the forecast may be different.” Please clarify.
Response: Thank you for your suggestive comment. Since there are different factors in different levels entering into the network, it is natural for us to think that the contribution made by different factors to the final result might be different as some meteorological elements are more tightly related to each other than others. In fact, according to Rasp et al., 2021, when predicting t850, the factor that contributes the most to the final is z250, significantly larger than other factors. That is why we decide to use the SE block to automatically select out the more important factors. Accordingly, we clarify this in our new MS
8.Section 2.1: Even with the reference to Hsu et al (2015) this is too short a description: “Remove other ISO signals”. What is ‘other’ in this context? Also it would be good to explicitly mention the data you are filtering. My suggestion is to move the data-description at the end of section 2.2. (weatherbench) to here. Then you can also clarify if you do the filtering on a gridpoint basis or not. Also please mention the train/validation/test splits that you make.
Response: In this filtering step, the other ISO signals are the interannual and interdecadal anomalies. The data subject to filtering include all the factors used in training the SE-Resnet and evaluating final results including the ERA5 dataset and CFSv2 forecast data. We are doing the filtering on a gridpoint basis. The training set includes ERA5 data in 1980-2015, the validation set in 2016 and test set in 2017-2018. We added the description of filtering process and the split of train/validation/test in more detail in the Section of data and method in our new MS.
9.L138-145: Introduce ResNet and what a residual block is and refer to the original paper (He et al., 2015), just like you do for the self-attention mechanism.
Response: The original Resnet is developed from VGG net. The original Resnet introduced in He et al., 2015 contains 34 convolution layers and ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. There are also shortcuts inserted between convolution layers which can be directly used when the input and output are of the same dimensions. The Resnet used in Rasp et al., 2021 is further developed based on He’s model, which contains 17 residual blocks. In each residual block, there are two convolution blocks defined as a 2D convolution layer, an activation layer, a batch normalization layer and a dropout layer. We have introduce this in our new MS.
10.L162-165: I miss a mention of the domain.
Response: The data that we use includes the global area with a resolution of 5.625°*5.625°.
11.L 203: here you start the discussion of forecasts. You mention that there are two types of predictions, one driven by the original data and one driven by the ISO components. This seems to concern the types of input, not the target against which it is trained (for both forecast models the task seems to be to predict unfiltered Z500/T850). Do I understand this correctly? Please outline the two variations already in the methods.
Response: In the two experiments that we have done, the first experiment used the original data (unfiltered) as the input and the output is also unfiltered and the second experiment used the filtered data as the input and the output is also filtered. When comparing the final results of these two experiments, we first extracted ISO component from the outputs of the first experiment (unfiltered) and compare it with the outputs of the second experiment (filtered) as shown in Fig. 3c and 3d. And they have a consistent low-frequency spatial distribution. Furthermore, since the skill of ISO predictions from the filtered input model is better than the unfiltered in this case. The following result in this paper are mainly show the filterer input model. To clarify the two variations, we add the statement of process detail in Sec. 3.1 in our new MS.
12.L 215: Statement: “Furthermore, in this case, the Z500 values predicted by the ISO components are closer to the ERA5 ISO components, with a mean RMSE of 575.96 (m2 s-2)”. It is obvious that the one driven by filtered information will indeed be closer to that information than the one without access to the filtered information. But it is confusing because in the sentences above this one, you discuss scores against unfiltered ERA5, and this seems also be the content of Figure 2. Perhaps make clear that here you discuss scoring against ISO components, and that that is not shown in Figure 2. Add something like “(not shown)”.
Response: In this experiment, we are actually comparing the output of the model against the filtered ERA5 data, which means the ISO components extracted from ERA5 dataset according to the method introduced in the Section of data and method.
13.Figure 3: From the title of panel a and b and the y labels of c and d I understand the figure presents both scores against ISO components and scores against unfiltered ERA5. Is that correct? Perhaps introduce these two variations of scoring in section 2.3, formula 1, where you would say that ti,j,k is either the filtered or unfiltered component. Or… if you always score against the ISO component (which is not clear to me), then say that ti,j,k is the ISO component.
Response: We would like to clarify it here that all the comparisons in this article, except the one shown in Figure 2a, are done against the filtered ERA5 data (ISO component). As for the formula in part 2, we have revised it according to your suggestion in our new MS.
14.L307-316: You present the scores stratified against latitude. The conclusion that differences are small in the tropics is not surprising. Z500 in the tropics is a nearly constant field with hardly any variability.
Response: Yes. But we still have to mention this conclusion because we want to emphasize it that the major difference comes from the extratropical region. Moreover, in the following figure 6 we showed the prediction results of z500 planetary waves to try to further explain why the differences between the two models are mainly presented in the mid- and high- latitudes.
Textual comments
15.L14-15 “Forecast results tend to become intraseasonal low-frequency components”. Unclear what is meant with this. At large forecast times, model output can of course still be high frequency. Do you mean that “as forecast time increases, the only skillful forecasts are those of low-frequency components”?
Response: Thank for your suggestive comment. Actually that is what we want to express. Theoretically, our model has the ability output to the high frequency components but it will be meaningless as the lead time of the model is already far beyond the predictability of those components. Accordingly, we revised this in our new MS.
16.L24 CFSv2 acronym mentioned without definition.
Response: The full name for CFSv2 is Climate Forecast System Version 2. This has been added to where it first appears in our New MS.
17.L27-28 Planetary wave numbers 3-8.
Response: Thank you for your suggestions. This mistake has been modified.
18.L144-145 But “with” zero?
Response: This sentence should be “All convolutions are padded periodically in the longitudinal direction with zero padding in the latitude direction.” It has been corrected in our new MS.
19.L177: More commonly known as “anomaly correlation coefficient”
Response: Thank you for your advice. The name for ACC has been modified.
20.Figure 7: mention in figure caption that z500 is in contours, and t2m in shading.
Response: Thank you for pointing out our mistakes. We have updated the figure title.
-
AC2: 'Reply on RC1', Dingan Huang, 22 Aug 2022
-
CEC1: 'Comment on gmd-2022-146', Juan Antonio Añel, 15 Aug 2022
Dear authors,
After checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlThe Zenodo repository for the code is the same as for the Data. In this way, your manuscript lacks the publication of code necessary to consider it for publication. Therefore, please, publish the code used in the manuscript in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available for the Discussions stage.
In this way, moreover, you must include in a potential reviewed version of your manuscript the modified 'Code and Data Availability' section, with the DOI of the repository.
Also, remember to include a license for your code in the repository. If you do not include a license, the code continues to be your property and can not be used by others, despite any statement on being free to use. Therefore, when uploading the model's code to the repository, you could want to choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You only need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc.
Please, be aware that failing to comply with this request could result in the rejection of your manuscript for publication.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2022-146-CEC1 -
AC1: 'Reply on CEC1', Dingan Huang, 19 Aug 2022
Dear Editor,
We are grateful for your comments on our manuscript. The reason why the Zenodo repository for code is the same as the data is that we put the code and data in the same Zenodo repository. Therefore, to make a clearer distinction between where the two are stored, we separated the code and data and re-uploaded to Zenodo repository. In addition, the license also has been added in the new repository.
The scripts for training the ResNet and SE-ResNet model, and constructing figures are available in the following Zenodo repository: https://zenodo.org/record/7009166 (Lu et al., 2022a). And the data for training the models and the prediction of the models are archived at https://zenodo.org/record/7009111 (Lu et al., 2022b).
Thank you and best regards.
Yours sincerely,
Chuhan Lu, Dingan Huang, Yichen Shen, Fei Xin
Citation: https://doi.org/10.5194/gmd-2022-146-AC1
-
AC1: 'Reply on CEC1', Dingan Huang, 19 Aug 2022
-
RC2: 'Comment on gmd-2022-146', Omar Jamil, 22 Aug 2022
General comments:
It is good to see more work being done in machine learning applications for weather and climate. This work aims to build on the work done by Rasp et al (2020) by applying SE-Resnet instead of Resnet. The focus here is on learning the intra-seasonal oscillations components. The physical model being compared to is CFSv2.General scientific comments:
I have some concerns about the methodology being employed in this paper. One of the first things lacking in the paper is a detailed description of data processing. With data-driven methods it is really important to be transparent about how the data were processed and applied. The reader is referred to WeatherBench paper, but the authors of this paper must outline what steps they took ensure their machine learning models were trained properly e.g. training-validation split, data standardisation/normalisation, what were the inputs for the neural network, how were the inputs structured? All these details are not only important for understanding the performance of the model, but also reproducibility.As the field of machine learning is an empirical science, it would be good to see the reasons behind why certain architectures were chosen e.g. number of residual blocks was set at 17 with two convolutional blocks -- why is this the best setup?
I am also concerned by point-to-point prediction setup with 30 models for different lead times. This suggests the models are not generalising and being optimised for a very specific subset of the problem. I would suggest asking the question, what is the real-world application of this? Training a separate model for each day of lead time is not efficient and suggests the models are not learning any of the underlying physical processes and therefore require retraining every time there is an improvements in the training model data.
There are extensive comparisons between CFSv2, Resnet, SE-Resnet, Climatology error and accuracy, which could benefit from having a table to make it easier to interpret, but some of the differences are so small that I would question its statistical significance. The main objective of the paper appears to be to show how deep learning can improve on the physical model's sub-seasonal forecast. However, for that comparison to be fair, it is important to show how well CFSv2 does under different setups and against other physical models. My question would be why choose CFSv2 for this comparison? Is that best model there is for these lead times? Also, the comparisons between SE-Resnet and Resnet do not appear to be statistically significant. In order to understand how different parameters impact ML model's performance, a Pareto front type analysis is required. For understanding the impact of random noise on the ML model, cross-validation should be used to show the difference between two different architectures is consistent and statistically significant.
Specific comments:
Figures 2, 7, A1 need to be bigger because in their current size it is almost impossible to see the details being discussed. Perhaps this was just formatting issue for the pre-print.
Line 246: 1.01% lower RMSE is a very small difference. It would be good to see statistical significance. Please see above my suggestions of Pareto front and cross-validation.
Lines 252-254: There is some odd phrasing which talks about 50% of deep learning models forecast being below climatology, but for CFSv2 50% of forecasts are above climatology. I am sure the authors did not mean to suggest this, but it currently reads as if CFSv2 and deep learning models are the same, but phrased in favour of deep learning models.
Line 265: 0.75% RMSE improvement. Statistical significance?
Line 272: It would be good to see the RMSE values for other models quoted too. A table of comparison would be really useful for at-a-glance interpretation.
Line 288, 291: SE-Resnet and Resnet are so close that there appears to be no difference between them. What is the main benefit of SE-Resnet for this application?
Line 315: SE-Resnet and CFSv2 are very close, so what are the improvements being brought about by SE-Resnet?
Citation: https://doi.org/10.5194/gmd-2022-146-RC2 -
AC3: 'Reply on RC2', Dingan Huang, 13 Sep 2022
General comments:
It is good to see more work being done in machine learning applications for weather and climate. This work aims to build on the work done by Rasp et al (2020) by applying SE-Resnet instead of Resnet. The focus here is on learning the intra-seasonal oscillations components. The physical model being compared to is CFSv2.
Response: Thanks a lot for your encouraging and suggestive comments. We have improved our manuscript as you suggested. Please see details in the revised version of the MS and the responses as follows.
General scientific comments:
1. I have some concerns about the methodology being employed in this paper. One of the first things lacking in the paper is a detailed description of data processing. With data-driven methods it is really important to be transparent about how the data were processed and applied. The reader is referred to WeatherBench paper, but the authors of this paper must outline what steps they took ensure their machine learning models were trained properly e.g. training-validation split, data standardisation/normalisation, what were the inputs for the neural network, how were the inputs structured? All these details are not only important for understanding the performance of the model, but also reproducibility.
Response: Thank you for your suggestions. The data used in this paper are listed as follow. (1) Model training data are provided by the WeatherBench challenge. A detailed description can be found in studies of Rasp et al. (2020), and you can get the latest data set on https://github.com/pangeo-data/WeatherBench. The data set mainly contains ERA5 data from 1979 to 2018, and the horizontal resolution of the dataset used in this paper is 5.625°×5.625°. (2) The forecast results of Z500 and T850 in the CFSv2 model data set for the next 30 days. This dataset was downloaded from https://www.ncei.noaa.gov containing data in 2000-2018.
The original CFSv2 forecast data we download covers the global area with a resolution of 1°×1°. The lead time we used ranges from 1-30 day. We interpolated the CFSv2 data into the same grid points as ERA5 data in this paper (5.625°×5.625°).
The inputs are geopotential, temperature, zonal and meridional wind at seven vertical levels (50, 250, 500, 600, 700, 850 and 925 hPa), 2-meter temperature, and finally three constant fields: the land-sea mask, orography and the latitude at each grid point. All fields were normalized by subtracting the mean and dividing by the standard deviation. In the training process, we first extract a batch of data from the dataset and stack up different variables into a 8 (batch size) * 32 (latitudinal grid number) * 64 (longitudinal grid number) * 32 (4*7+4) array.
The training set includes ERA5 data in 1980-2015, the validation set in 2016 and test set in 2017-2018.
Accordingly, we have added the description of about how we get and process the CFSv2 dataset as well as ERA5 data set in more detail in the Sec. 2.1 (data and method) of our new MS.
2. As the field of machine learning is an empirical science, it would be good to see the reasons behind why certain architectures were chosen e.g. number of residual blocks was set at 17 with two convolutional blocks -- why is this the best setup?
Response: Thank you for your precious advice. The forecast model used in this paper is developed based on the ResNet model designed by Rasp et al. (2021). In their study, it was found that the ResNet structure performed well for the prediction of Z500 and T850, so we used the same residual block structure with two convolutional blocks. As for the number of residual blocks, we tested networks with different numbers of blocks and found that compared with the 19 residual blocks of the original ResNet model in Rasp et al. (2020), reducing the number of residual blocks can also get comparable or better prediction effect, so number of residual blocks was set at 17 in our first MS. However, for continuous forecast, we determined the number of residual blocks of SE-ResNet model as 25 through experiments. The comparison of the networks’ performance with different numbers of blocks is shown in the figure below.
In addition, since there are different factors in different levels entering into the network, it is natural for us to think that the contribution made by different factors to the final result might be different as some meteorological elements are more tightly related to each other than others. In fact, according to Rasp et al. (2021), when predicting T850, the factor that contributes the most to the final is Z250, significantly larger than other factors, that means more attention should be paid to this kind of factors. That is why we decide to use the SE block to automatically select out the more important factors.
3. I am also concerned by point-to-point prediction setup with 30 models for different lead times. This suggests the models are not generalising and being optimised for a very specific subset of the problem. I would suggest asking the question, what is the real-world application of this? Training a separate model for each day of lead time is not efficient and suggests the models are not learning any of the underlying physical processes and therefore require retraining every time there is an improvement in the training model data.
Response: Thank you for critical your advice. In order to learn the underlying physical processes, we have changed the original direct model into a continuous model (both defined in Rasp et al. (2021)). We have tested these two types of models and found out that the performance of the new model is comparable with the old models. Particularly, the average RMSE for 10-30 day of direct model is 552.48 m2 s-2(2.17 K) and average RMSE for 10-30 day of continuous model is 558.68 m2 s-2(2.19 K).
4. There are extensive comparisons between CFSv2, Resnet, SE-Resnet, Climatology error and accuracy, which could benefit from having a table to make it easier to interpret, but some of the differences are so small that I would question its statistical significance. The main objective of the paper appears to be to show how deep learning can improve on the physical model's sub-seasonal forecast. However, for that comparison to be fair, it is important to show how well CFSv2 does under different setups and against other physical models. My question would be why choose CFSv2 for this comparison? Is that best model there is for these lead times? Also, the comparisons between SE-Resnet and Resnet do not appear to be statistically significant. In order to understand how different parameters impact ML model's performance, a Pareto front type analysis is required. For understanding the impact of random noise on the ML model, cross-validation should be used to show the difference between two different architectures is consistent and statistically significant.
Response: The CFSv2 model we choose is widely-used around the world for extended-range forecast and represents the top ability in weather forecasting. Also, considering we are extracting the low frequency component as the predictand, the filtering method prefers input data being continuous time series. So we also chose the CMA, UKMO, KMA and NCEP model outputs forecast by S2S (sub-seasonal to seasonal prediction project) from ECMWF, and they are the only four models that provide daily extended-range prediction in consistent with our models as well as CFSv2. You may note that the forecast we added all have a weaker performance compared with CFSv2. That is perhaps because according to Saha et al. (2013), the final result of CFSv2 operational forecast is actually an ensemble mean of 16 runs, but the forecasts we added later only contains one control run. So that is the reason why the gap between CFSv2 forecast and other forecasts are that wide (Fig. S2). It can also tell the superiority of CFSv2.
As for the question that how different parameters impact the model’s performance. We carried out some experiments by training SE-Resnet and Resnet with different numbers of residual blocks. As is shown in above Fig. S1, we are able to know how the parameter of residual block number influence the performance of the models, by comparing the average RMSE these models have in forecasting Z500 and T850 10-30 day ahead. Finally, we chose the 25 blocks, which is the best choice due to the lowest RMSE for both Z500 and T850.
We also used different test sets to verify the structural differences between SE-ResNet and ResNet models (Tab. S1) for the cross-validation. The results show that the prediction effect of SE-ResNet is stably better than that of ResNet as the test set changed. Particularly, the average gap for Z500 is 4.03 m2 s-2, the average gap for T850 is 0.01 K. The table shown below not only indicate that the addition of SE module can improve the ResNet model to a certain extent, but also shows that the superiority is steady as for all the test subsets, SE-Resnet outperforms ResNet. Based on these facts, we can say that the SE blocks significantly improve the performance of the network.
Specific comments:
5. Figures 2, 7, A1 need to be bigger because in their current size it is almost impossible to see the details being discussed. Perhaps this was just formatting issue for the pre-print.
Response: Thank you for your suggestions. We have modified the size of these pictures.
6. Line 246: 1.01% lower RMSE is a very small difference. It would be good to see statistical significance. Please see above my suggestions of Pareto front and cross-validation.
Response: To verify the difference between them, we conducted average test on the prediction effects of ResNet and SE-ResNet model. The results showed that the two models passed the significance test (α = 0.05) within 10-16 day of continuous forecast (Fig. S3). However, with the extension of the forecast time, the prediction effects of the two models are close to the climatology prediction, and the difference is gradually reduced, which leads to a smaller average increase of SE-ResNet model within 10-30 day.
7. Lines 252-254: There is some odd phrasing which talks about 50% of deep learning models forecast being below climatology, but for CFSv2 50% of forecasts are above climatology. I am sure the authors did not mean to suggest this, but it currently reads as if CFSv2 and deep learning models are the same, but phrased in favour of deep learning models.
Response: Thank you for your advice. What we want to express is that 75 % of the samples predicted by the deep learning model are below the climatological forecast in 10-15 day, and more than 50 % of the samples predicted remain below the climatological forecast after that. While the CFSv2 model predictions have more than 50 % of the samples higher than the climatological forecast in 16-20 day and beyond. So the deep learning models outperforms CFSv2. We have added the new description in our new MS.
8. Line 265: 0.75% RMSE improvement. Statistical significance?
Response: Thank you for your suggestion. This question is the same one posed on line 246, please refer to the answer above.
9. Line 272: It would be good to see the RMSE values for other models quoted too. A table of comparison would be really useful for at-a-glance interpretation.
Response: Thank you for your advice. We have also provided a table version of the RMSE comparisons. Also, we have add three new forecast models into the comparison, the CMA model, the UKMO model, the KMA model and the NCEP model. Please refer to Tab. S2 in our new MS.
10. Line 288, 291: SE-Resnet and Resnet are so close that there appears to be no difference between them. What is the main benefit of SE-Resnet for this application?
Response: Although there is only a slight improvement between SE-Resnet and Resnet in RMSE and only passed the significance test (α = 0.05) within 10-16 day of continuous forecast, we believe that the SE block plays an indispensable role here. After introducing this block into the network, we are able to determine the importance of the input variables so that the model's simulation of the occurrence and development of the weather system is more realistic, as is shown in Fig. 7 of MS.
11. Line 315: SE-Resnet and CFSv2 are very close, so what are the improvements being brought about by SE-Resnet?
Response: The final results of the two networks are close but there are actually significant improvements in SE-Resnet. We performed statistical tests for the RMSE of the models and found out that the numbers passed the test (α = 0.05) within 10-30 day, and the RMSE of SE-ResNet is 558.68 m2 s-2(2.19 K) while the RMSE of CFSv2 is 577.65 m2 s-2(2.29 K), which is able to prove that significant improvement exists.
-
AC3: 'Reply on RC2', Dingan Huang, 13 Sep 2022
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2022-146', Chiem van Straaten, 08 Aug 2022
General impression
This paper presents a deep learning framework for sub-seasonal forecasting. The novelty relative to Rasp et al (2020) is the self-attention mechanism. An additional innovation is the extraction of intra-seasonal oscillation components. This extraction eases the learning task, as the only predictable atmospheric motions at the sub-seasonal lead time are the low-frequency components. The authors compare their framework to a version without self-attention and to numerical forecasts. The comparison is detailed, with multiple scores and a case study. Possibilities for clarification do however exist. This could increase the trustworthiness of the results.
Overall comment:
A big finding of the study is that the deep learning models outperform the CFSv2 model beyond 10 days. But currently the comparison lacks information, rendering it un-trustworthy. The authors say they use the WeatherBench dataset but CFS is not part of the benchmark dataset. How was it obtained, and especially, how was it processed?
In the verification you make use of the ISO components (e.g. fig 4a), which requires the removal of the seasonal cycle. Was the seasonal cycle for CFS estimated from model reforecasts or something else? Choices like that influence results (see for instance Manrique et al 2020). Also, how was lead time treated in your CFS-processing? At a lead time of 1 days for instance, right after initialization, it is impossible to subtract a 15 day rolling average of the previous days, which is step 2 in your ISO procedure (see section 2.1). Because you often verify the ISO components of circulation, it seems unfair if CFS is the only unfiltered model in that comparison. Please clarify.
Your figures contain a hint that the processing is unfair. Performance of the CFS model is heavily curved: it dips below that of the climatology benchmark and only later levels off (fig, 4a, 6a, 8a). Such behavior is unexpected for a numerical model. The numerical skill in z500 should directly level off at the climatology (Figure 4a, Buizza & Leutbecher 2015).Specific comments
L42: Is Mayer et al 2021 the best reference for the chaotic nature of the atmosphere? Why not original work like that of Ed Lorenz?
L62: Unclear in what way predictability relates to spatio-temporal scale. I know myself that larger / low-frequency is more predictable, but perhaps good to explicitly state this. Also you can refer to Buizza and Leutbecher (2015).
L89-90: Explain why residual connections give the network this ability.
L115 Vague: “the contributions of evolution among different factors to the forecast may be different.” Please clarify.
Section 2.1: Even with the reference to Hsu et al (2015) this is too short a description: “Remove other ISO signals”. What is ‘other’ in this context? Also it would be good to explicitly mention the data you are filtering. My suggestion is to move the data-description at the end of section 2.2. (weatherbench) to here. Then you can also clarify if you do the filtering on a gridpoint basis or not. Also please mention the train/validation/test splits that you make.
L138-145: Introduce ResNet and what a residual block is and refer to the original paper (He et al., 2015), just like you do for the self-attention mechanism.
L162-165: I miss a mention of the domain.
L 203: here you start the discussion of forecasts. You mention that there are two types of predictions, one driven by the original data and one driven by the ISO components. This seems to concern the types of input, not the target against which it is trained (for both forecast models the task seems to be to predict unfiltered Z500/T850). Do I understand this correctly? Please outline the two variations already in the methods.
L 215: Statement: “Furthermore, in this case, the Z500 values predicted by the ISO components are closer to the ERA5 ISO components, with a mean RMSE of 575.96 (m 2 s -2 )”. It is obvious that the one driven by filtered information will indeed be closer to that information than the one without access to the filtered information. But it is confusing because in the sentences above this one, you discuss scores against unfiltered ERA5, and this seems also be the content of Figure 2. Perhaps make clear that here you discuss scoring against ISO components, and that that is not shown in Figure 2. Add something like “(not shown)”.
Figure 3: From the title of panel a and b and the y labels of c and d I understand the figure presents both scores against ISO components and scores against unfiltered ERA5. Is that correct? Perhaps introduce these two variations of scoring in section 2.3, formula 1, where you would say that ti,j,k is either the filtered or unfiltered component. Or… if you always score against the ISO component (which is not clear to me), then say that ti,j,k is the ISO component.
L307-316: You present the scores stratified against latitude. The conclusion that differences are small in the tropics is not surprising. Z500 in the tropics is a nearly constant field with hardly any variability.
Textual comments
L14-15 “Forecast results tend to become intraseasonal low-frequency components”. Unclear what is meant with this. At large forecast times, model output can of course still be high frequency. Do you mean that “as forecast time increases, the only skillful forecasts are those of low-frequency components”?
L24 CFSv2 acronym mentioned without definition.
L27-28 Planetary wave numbers 3-8.
L144-145 But “with” zero?
L177: More commonly known as “anomaly correlation coefficient”
Figure 7: mention in figure caption that z500 is in contours, and t2m in shading.
Additional literature referred to:
Manrique-Suñén, A.; Gonzalez-Reviriego, N.; Torralba, V.; Cortesi, N. & Doblas-Reyes, F. J. Choices in the Verification of S2S Forecasts and Their Implications for Climate Services Monthly Weather Review, American Meteorological Society, 2020, 148, 3995 - 4008
Buizza, R. & Leutbecher, M. The forecast skill horizon Quarterly Journal of the Royal Meteorological Society, Wiley Online Library, 2015, 141, 3366-3382
Citation: https://doi.org/10.5194/gmd-2022-146-RC1 -
AC2: 'Reply on RC1', Dingan Huang, 22 Aug 2022
General impression
This paper presents a deep learning framework for sub-seasonal forecasting. The novelty relative to Rasp et al (2020) is the self-attention mechanism. An additional innovation is the extraction of intra-seasonal oscillation components. This extraction eases the learning task, as the only predictable atmospheric motions at the sub-seasonal lead time are the low-frequency components. The authors compare their framework to a version without self-attention and to numerical forecasts. The comparison is detailed, with multiple scores and a case study. Possibilities for clarification do however exist. This could increase the trustworthiness of the results.
Response: Thanks a lot for your encouraging and suggestive comments. We have improved our manuscript as you suggested. Please see details the responses as follows.
Overall comment:
1.A big finding of the study is that the deep learning models outperform the CFSv2 model beyond 10 days. But currently the comparison lacks information, rendering it un-trustworthy. The authors say they use the WeatherBench dataset but CFS is not part of the benchmark dataset. How was it obtained, and especially, how was it processed?
Response: Thank you for your constructive comment. The data used in this paper are listed as follow. (1) Model training data are provided by the WeatherBench challenge. A detailed description can be found in studies of Rasp et al. (2020), and you can get the latest data set on https://github.com/pangeo-data/WeatherBench. The data set mainly contains ERA5 data from 1979 to 2018, and the horizontal resolution of the data set used in this paper is 5.625°×5.625°. (2) The forecast results of Z500 and T850 in the CFSv2 model data set for the next 30 days. This dataset was downloaded from https://www.ncei.noaa.gov containing data in 2000-2018.
The original CFSv2 forecast data we download covers the global area with a resolution of 1°×1°. The lead time we used ranges from 1-30 days. We interpolated the CFSv2 data into the same grid points as ERA5 data in this paper (5.625°×5.625°). Accordingly, we have added the description of about how we get and process the CFSv2 dataset as well as ERA5 data set in more detail in the Sec. 2.1 (data and method) of our new MS.
Dataset citation:
Saha, S., Moorthi, S., Wu, X., Wang, J., Nadiga, S., Tripp, P., Behringer, D., Hou, Y., Chuang, H., Iredell, M., Ek, M., Meng, J., Yang, R., Mendez, M. P., van den Dool, H., Zhang, Q., Wang, W., Chen, M., and Becker, E.: The NCEP Climate Forecast System Version 2, Journal of Climate, 27(6), 2185-2208, https://doi.org/10.1175/JCLI-D-12-00823.1, 2014.
Hersbach H, Bell B, Berrisford P, et al. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 2020, 146(730).
2.In the verification you make use of the ISO components (e.g. fig 4a), which requires the removal of the seasonal cycle. Was the seasonal cycle for CFS estimated from model reforecasts or something else? Choices like that influence results (see for instance Manrique et al 2020).
Also, how was lead time treated in your CFS-processing? At a lead time of 1 days for instance, right after initialization, it is impossible to subtract a 15 day rolling average of the previous days, which is step 2 in your ISO procedure (see section 2.1). Because you often verify the ISO components of circulation, it seems unfair if CFS is the only unfiltered model in that comparison. Please clarify.
Response: Thank you for your critical suggestion. The filtering method includes three steps. In the first step, we subtract the 90-day low-pass filtered component so as to remove the slow-varying seasonal cycle.
Before this step, the daily anomalies used to calculate the 90-day low-pass filtered component are obtained by subtracting ERA5 climate values from the corresponding variables. The second step is to remove the interannual and interdecadal anomalies by subtracting the last 15-day running mean. And the last step is to remove the synoptic scale components by taking a 5-day running mean.
Therefore, this method is try to take the 10-30 day ISO component. For example, taking the time series of Z500 at an arbitrary grid point (2.8125°N,180°) of 181-300 days in 2017 from CFS data, the filter time series display clear intraseasonal feature and is well consistent with the one derived from 10-30 day Lanczos filtering method, and the correlation coefficient is 0.72 (Fig. S1a, see the supplemental figure file).
Considering the choices of seasonal cycle from different dataset may influence the results (e.g. Manrique et al 2020), we also show the filtered results using 90-day low-pass-filtered climate values (2000-2010) calculated from both ERA5 and CFS data in Fig. S1b, respectively. It can be found that, when filtering CFS data, the filter time series calculated using ERA5 or CFS data show highly consistency. Furthermore, we also demonstrated the overall difference in filtering CFS data using both CFS and ERA5 climate values and quantified the difference of between the two filtering results (Fig. S1c). The differences consistently remain at a low level, with a value of 2.45 m2 s-2 approximately. However, due to the limited time available in CFS data, the climate data used in the filtering process in this paper are calculated from ERA5 data from 1981 to 2010. To better clarify this, we have added statements in our new MS.
In the filtering step, we firstly divide the whole dataset into 30 sub-datasets according to the lead time (+1~+30day). In each sub-dataset, we place forecasts of different dates with the same lead time together following the time sequence. For example, for the part with lead time of 10 days, the data contained includes (1) 2017.1.1 forecasting 2017.1.11 (2) 2017.1.2 forecasting 2017.1.12 and so on, then we got the time series from 2017.1.11 to 2018.12.31. The filtering step is carried out in each part separately. For the filtering method we used, please refer to the corresponding section. Some of the data in 2016 is also used in the filtering part so that we can get a result for every day in 2017-2018. Accordingly, we have added the description of about how we process the lead time in CFSv2 dataset in detail in the Sec. 2.1 (data and method) of our new MS.
3.Your figures contain a hint that the processing is unfair. Performance of the CFS model is heavily curved: it dips below that of the climatology benchmark and only later levels off (fig, 4a, 6a, 8a). Such behavior is unexpected for a numerical model. The numerical skill in z500 should directly level off at the climatology (Figure 4a, Buizza & Leutbecher 2015).
Response: Thank you for your critical comment. The climatology benchmark shown in this paper are calculated from ERA5 data. Since it is a state-of-art reanalysis dataset (Hersbach et al. 2020), we consider it as a ground-truth in this paper. If we used the forecasting model derived climatology like the way in Buizza and Leutbecher (2015) and then compare to the forecasting result with the ERA5 data, we can see the prediction effect of CFS is gradually approaching the climatology of CFS model with the increase of forecast time (Fig. S2). However, it can be found that the skill of ERA5 climatology benchmark is clearly better than CFS climatology forecasting (the RMSE is 577.62 m2 s-2 vs 598.88 m2 s-2 and the ACC is 70.01% vs 67.48%) . That is because we used the ERA5 reanalysis as a ground-truth in this paper as mentioned above. Accordingly, we have added the CFS climatology forecasting result and description in our new MS.
Specific comments
4.L42: Is Mayer et al 2021 the best reference for the chaotic nature of the atmosphere? Why not original work like that of Ed Lorenz?
Response: Ed Lorenz’s work ‘Deterministic nonperiodic flow’ has been added to the reference.
Lorenz, E. N.: Deterministic nonperiodic flow, Journal of the Atmospheric Science, 20, 130-141, https://doi.org/10.1007/978-0-387-21830-4_2, 1963.
5.L62: Unclear in what way predictability relates to spatio-temporal scale. I know myself that larger / low-frequency is more predictable, but perhaps good to explicitly state this. Also you can refer to Buizza and Leutbecher (2015).
Response: What we want to express here is that weather systems with different spatial-temporal scales often have different predictability. For example, for strong convective weathers such as thunderstorms, hails and tornados, the upper limit of predictability for these weathers will be several hours. For synoptic systems, the predictability can reach up to two weeks and for systems of a planetary scale, it will be much longer. We state this in our new MS and also refer to Buizza and Leutbecher (2015).
6.L89-90: Explain why residual connections give the network this ability.
Response: Resnet is developed from convolutional networks to solve the degradation problem that could happen when the network is too deep. According to Rasp et al. 2021, the use of multiple layers of residuals can always retain the data features of the previous layer, while continuously digging deeper into the data relationship so that the network can memorize previous information in the process of extracting information. We added this statement in our new MS
7.L115 Vague: “the contributions of evolution among different factors to the forecast may be different.” Please clarify.
Response: Thank you for your suggestive comment. Since there are different factors in different levels entering into the network, it is natural for us to think that the contribution made by different factors to the final result might be different as some meteorological elements are more tightly related to each other than others. In fact, according to Rasp et al., 2021, when predicting t850, the factor that contributes the most to the final is z250, significantly larger than other factors. That is why we decide to use the SE block to automatically select out the more important factors. Accordingly, we clarify this in our new MS
8.Section 2.1: Even with the reference to Hsu et al (2015) this is too short a description: “Remove other ISO signals”. What is ‘other’ in this context? Also it would be good to explicitly mention the data you are filtering. My suggestion is to move the data-description at the end of section 2.2. (weatherbench) to here. Then you can also clarify if you do the filtering on a gridpoint basis or not. Also please mention the train/validation/test splits that you make.
Response: In this filtering step, the other ISO signals are the interannual and interdecadal anomalies. The data subject to filtering include all the factors used in training the SE-Resnet and evaluating final results including the ERA5 dataset and CFSv2 forecast data. We are doing the filtering on a gridpoint basis. The training set includes ERA5 data in 1980-2015, the validation set in 2016 and test set in 2017-2018. We added the description of filtering process and the split of train/validation/test in more detail in the Section of data and method in our new MS.
9.L138-145: Introduce ResNet and what a residual block is and refer to the original paper (He et al., 2015), just like you do for the self-attention mechanism.
Response: The original Resnet is developed from VGG net. The original Resnet introduced in He et al., 2015 contains 34 convolution layers and ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. There are also shortcuts inserted between convolution layers which can be directly used when the input and output are of the same dimensions. The Resnet used in Rasp et al., 2021 is further developed based on He’s model, which contains 17 residual blocks. In each residual block, there are two convolution blocks defined as a 2D convolution layer, an activation layer, a batch normalization layer and a dropout layer. We have introduce this in our new MS.
10.L162-165: I miss a mention of the domain.
Response: The data that we use includes the global area with a resolution of 5.625°*5.625°.
11.L 203: here you start the discussion of forecasts. You mention that there are two types of predictions, one driven by the original data and one driven by the ISO components. This seems to concern the types of input, not the target against which it is trained (for both forecast models the task seems to be to predict unfiltered Z500/T850). Do I understand this correctly? Please outline the two variations already in the methods.
Response: In the two experiments that we have done, the first experiment used the original data (unfiltered) as the input and the output is also unfiltered and the second experiment used the filtered data as the input and the output is also filtered. When comparing the final results of these two experiments, we first extracted ISO component from the outputs of the first experiment (unfiltered) and compare it with the outputs of the second experiment (filtered) as shown in Fig. 3c and 3d. And they have a consistent low-frequency spatial distribution. Furthermore, since the skill of ISO predictions from the filtered input model is better than the unfiltered in this case. The following result in this paper are mainly show the filterer input model. To clarify the two variations, we add the statement of process detail in Sec. 3.1 in our new MS.
12.L 215: Statement: “Furthermore, in this case, the Z500 values predicted by the ISO components are closer to the ERA5 ISO components, with a mean RMSE of 575.96 (m2 s-2)”. It is obvious that the one driven by filtered information will indeed be closer to that information than the one without access to the filtered information. But it is confusing because in the sentences above this one, you discuss scores against unfiltered ERA5, and this seems also be the content of Figure 2. Perhaps make clear that here you discuss scoring against ISO components, and that that is not shown in Figure 2. Add something like “(not shown)”.
Response: In this experiment, we are actually comparing the output of the model against the filtered ERA5 data, which means the ISO components extracted from ERA5 dataset according to the method introduced in the Section of data and method.
13.Figure 3: From the title of panel a and b and the y labels of c and d I understand the figure presents both scores against ISO components and scores against unfiltered ERA5. Is that correct? Perhaps introduce these two variations of scoring in section 2.3, formula 1, where you would say that ti,j,k is either the filtered or unfiltered component. Or… if you always score against the ISO component (which is not clear to me), then say that ti,j,k is the ISO component.
Response: We would like to clarify it here that all the comparisons in this article, except the one shown in Figure 2a, are done against the filtered ERA5 data (ISO component). As for the formula in part 2, we have revised it according to your suggestion in our new MS.
14.L307-316: You present the scores stratified against latitude. The conclusion that differences are small in the tropics is not surprising. Z500 in the tropics is a nearly constant field with hardly any variability.
Response: Yes. But we still have to mention this conclusion because we want to emphasize it that the major difference comes from the extratropical region. Moreover, in the following figure 6 we showed the prediction results of z500 planetary waves to try to further explain why the differences between the two models are mainly presented in the mid- and high- latitudes.
Textual comments
15.L14-15 “Forecast results tend to become intraseasonal low-frequency components”. Unclear what is meant with this. At large forecast times, model output can of course still be high frequency. Do you mean that “as forecast time increases, the only skillful forecasts are those of low-frequency components”?
Response: Thank for your suggestive comment. Actually that is what we want to express. Theoretically, our model has the ability output to the high frequency components but it will be meaningless as the lead time of the model is already far beyond the predictability of those components. Accordingly, we revised this in our new MS.
16.L24 CFSv2 acronym mentioned without definition.
Response: The full name for CFSv2 is Climate Forecast System Version 2. This has been added to where it first appears in our New MS.
17.L27-28 Planetary wave numbers 3-8.
Response: Thank you for your suggestions. This mistake has been modified.
18.L144-145 But “with” zero?
Response: This sentence should be “All convolutions are padded periodically in the longitudinal direction with zero padding in the latitude direction.” It has been corrected in our new MS.
19.L177: More commonly known as “anomaly correlation coefficient”
Response: Thank you for your advice. The name for ACC has been modified.
20.Figure 7: mention in figure caption that z500 is in contours, and t2m in shading.
Response: Thank you for pointing out our mistakes. We have updated the figure title.
-
AC2: 'Reply on RC1', Dingan Huang, 22 Aug 2022
-
CEC1: 'Comment on gmd-2022-146', Juan Antonio Añel, 15 Aug 2022
Dear authors,
After checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlThe Zenodo repository for the code is the same as for the Data. In this way, your manuscript lacks the publication of code necessary to consider it for publication. Therefore, please, publish the code used in the manuscript in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available for the Discussions stage.
In this way, moreover, you must include in a potential reviewed version of your manuscript the modified 'Code and Data Availability' section, with the DOI of the repository.
Also, remember to include a license for your code in the repository. If you do not include a license, the code continues to be your property and can not be used by others, despite any statement on being free to use. Therefore, when uploading the model's code to the repository, you could want to choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You only need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc.
Please, be aware that failing to comply with this request could result in the rejection of your manuscript for publication.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2022-146-CEC1 -
AC1: 'Reply on CEC1', Dingan Huang, 19 Aug 2022
Dear Editor,
We are grateful for your comments on our manuscript. The reason why the Zenodo repository for code is the same as the data is that we put the code and data in the same Zenodo repository. Therefore, to make a clearer distinction between where the two are stored, we separated the code and data and re-uploaded to Zenodo repository. In addition, the license also has been added in the new repository.
The scripts for training the ResNet and SE-ResNet model, and constructing figures are available in the following Zenodo repository: https://zenodo.org/record/7009166 (Lu et al., 2022a). And the data for training the models and the prediction of the models are archived at https://zenodo.org/record/7009111 (Lu et al., 2022b).
Thank you and best regards.
Yours sincerely,
Chuhan Lu, Dingan Huang, Yichen Shen, Fei Xin
Citation: https://doi.org/10.5194/gmd-2022-146-AC1
-
AC1: 'Reply on CEC1', Dingan Huang, 19 Aug 2022
-
RC2: 'Comment on gmd-2022-146', Omar Jamil, 22 Aug 2022
General comments:
It is good to see more work being done in machine learning applications for weather and climate. This work aims to build on the work done by Rasp et al (2020) by applying SE-Resnet instead of Resnet. The focus here is on learning the intra-seasonal oscillations components. The physical model being compared to is CFSv2.General scientific comments:
I have some concerns about the methodology being employed in this paper. One of the first things lacking in the paper is a detailed description of data processing. With data-driven methods it is really important to be transparent about how the data were processed and applied. The reader is referred to WeatherBench paper, but the authors of this paper must outline what steps they took ensure their machine learning models were trained properly e.g. training-validation split, data standardisation/normalisation, what were the inputs for the neural network, how were the inputs structured? All these details are not only important for understanding the performance of the model, but also reproducibility.As the field of machine learning is an empirical science, it would be good to see the reasons behind why certain architectures were chosen e.g. number of residual blocks was set at 17 with two convolutional blocks -- why is this the best setup?
I am also concerned by point-to-point prediction setup with 30 models for different lead times. This suggests the models are not generalising and being optimised for a very specific subset of the problem. I would suggest asking the question, what is the real-world application of this? Training a separate model for each day of lead time is not efficient and suggests the models are not learning any of the underlying physical processes and therefore require retraining every time there is an improvements in the training model data.
There are extensive comparisons between CFSv2, Resnet, SE-Resnet, Climatology error and accuracy, which could benefit from having a table to make it easier to interpret, but some of the differences are so small that I would question its statistical significance. The main objective of the paper appears to be to show how deep learning can improve on the physical model's sub-seasonal forecast. However, for that comparison to be fair, it is important to show how well CFSv2 does under different setups and against other physical models. My question would be why choose CFSv2 for this comparison? Is that best model there is for these lead times? Also, the comparisons between SE-Resnet and Resnet do not appear to be statistically significant. In order to understand how different parameters impact ML model's performance, a Pareto front type analysis is required. For understanding the impact of random noise on the ML model, cross-validation should be used to show the difference between two different architectures is consistent and statistically significant.
Specific comments:
Figures 2, 7, A1 need to be bigger because in their current size it is almost impossible to see the details being discussed. Perhaps this was just formatting issue for the pre-print.
Line 246: 1.01% lower RMSE is a very small difference. It would be good to see statistical significance. Please see above my suggestions of Pareto front and cross-validation.
Lines 252-254: There is some odd phrasing which talks about 50% of deep learning models forecast being below climatology, but for CFSv2 50% of forecasts are above climatology. I am sure the authors did not mean to suggest this, but it currently reads as if CFSv2 and deep learning models are the same, but phrased in favour of deep learning models.
Line 265: 0.75% RMSE improvement. Statistical significance?
Line 272: It would be good to see the RMSE values for other models quoted too. A table of comparison would be really useful for at-a-glance interpretation.
Line 288, 291: SE-Resnet and Resnet are so close that there appears to be no difference between them. What is the main benefit of SE-Resnet for this application?
Line 315: SE-Resnet and CFSv2 are very close, so what are the improvements being brought about by SE-Resnet?
Citation: https://doi.org/10.5194/gmd-2022-146-RC2 -
AC3: 'Reply on RC2', Dingan Huang, 13 Sep 2022
General comments:
It is good to see more work being done in machine learning applications for weather and climate. This work aims to build on the work done by Rasp et al (2020) by applying SE-Resnet instead of Resnet. The focus here is on learning the intra-seasonal oscillations components. The physical model being compared to is CFSv2.
Response: Thanks a lot for your encouraging and suggestive comments. We have improved our manuscript as you suggested. Please see details in the revised version of the MS and the responses as follows.
General scientific comments:
1. I have some concerns about the methodology being employed in this paper. One of the first things lacking in the paper is a detailed description of data processing. With data-driven methods it is really important to be transparent about how the data were processed and applied. The reader is referred to WeatherBench paper, but the authors of this paper must outline what steps they took ensure their machine learning models were trained properly e.g. training-validation split, data standardisation/normalisation, what were the inputs for the neural network, how were the inputs structured? All these details are not only important for understanding the performance of the model, but also reproducibility.
Response: Thank you for your suggestions. The data used in this paper are listed as follow. (1) Model training data are provided by the WeatherBench challenge. A detailed description can be found in studies of Rasp et al. (2020), and you can get the latest data set on https://github.com/pangeo-data/WeatherBench. The data set mainly contains ERA5 data from 1979 to 2018, and the horizontal resolution of the dataset used in this paper is 5.625°×5.625°. (2) The forecast results of Z500 and T850 in the CFSv2 model data set for the next 30 days. This dataset was downloaded from https://www.ncei.noaa.gov containing data in 2000-2018.
The original CFSv2 forecast data we download covers the global area with a resolution of 1°×1°. The lead time we used ranges from 1-30 day. We interpolated the CFSv2 data into the same grid points as ERA5 data in this paper (5.625°×5.625°).
The inputs are geopotential, temperature, zonal and meridional wind at seven vertical levels (50, 250, 500, 600, 700, 850 and 925 hPa), 2-meter temperature, and finally three constant fields: the land-sea mask, orography and the latitude at each grid point. All fields were normalized by subtracting the mean and dividing by the standard deviation. In the training process, we first extract a batch of data from the dataset and stack up different variables into a 8 (batch size) * 32 (latitudinal grid number) * 64 (longitudinal grid number) * 32 (4*7+4) array.
The training set includes ERA5 data in 1980-2015, the validation set in 2016 and test set in 2017-2018.
Accordingly, we have added the description of about how we get and process the CFSv2 dataset as well as ERA5 data set in more detail in the Sec. 2.1 (data and method) of our new MS.
2. As the field of machine learning is an empirical science, it would be good to see the reasons behind why certain architectures were chosen e.g. number of residual blocks was set at 17 with two convolutional blocks -- why is this the best setup?
Response: Thank you for your precious advice. The forecast model used in this paper is developed based on the ResNet model designed by Rasp et al. (2021). In their study, it was found that the ResNet structure performed well for the prediction of Z500 and T850, so we used the same residual block structure with two convolutional blocks. As for the number of residual blocks, we tested networks with different numbers of blocks and found that compared with the 19 residual blocks of the original ResNet model in Rasp et al. (2020), reducing the number of residual blocks can also get comparable or better prediction effect, so number of residual blocks was set at 17 in our first MS. However, for continuous forecast, we determined the number of residual blocks of SE-ResNet model as 25 through experiments. The comparison of the networks’ performance with different numbers of blocks is shown in the figure below.
In addition, since there are different factors in different levels entering into the network, it is natural for us to think that the contribution made by different factors to the final result might be different as some meteorological elements are more tightly related to each other than others. In fact, according to Rasp et al. (2021), when predicting T850, the factor that contributes the most to the final is Z250, significantly larger than other factors, that means more attention should be paid to this kind of factors. That is why we decide to use the SE block to automatically select out the more important factors.
3. I am also concerned by point-to-point prediction setup with 30 models for different lead times. This suggests the models are not generalising and being optimised for a very specific subset of the problem. I would suggest asking the question, what is the real-world application of this? Training a separate model for each day of lead time is not efficient and suggests the models are not learning any of the underlying physical processes and therefore require retraining every time there is an improvement in the training model data.
Response: Thank you for critical your advice. In order to learn the underlying physical processes, we have changed the original direct model into a continuous model (both defined in Rasp et al. (2021)). We have tested these two types of models and found out that the performance of the new model is comparable with the old models. Particularly, the average RMSE for 10-30 day of direct model is 552.48 m2 s-2(2.17 K) and average RMSE for 10-30 day of continuous model is 558.68 m2 s-2(2.19 K).
4. There are extensive comparisons between CFSv2, Resnet, SE-Resnet, Climatology error and accuracy, which could benefit from having a table to make it easier to interpret, but some of the differences are so small that I would question its statistical significance. The main objective of the paper appears to be to show how deep learning can improve on the physical model's sub-seasonal forecast. However, for that comparison to be fair, it is important to show how well CFSv2 does under different setups and against other physical models. My question would be why choose CFSv2 for this comparison? Is that best model there is for these lead times? Also, the comparisons between SE-Resnet and Resnet do not appear to be statistically significant. In order to understand how different parameters impact ML model's performance, a Pareto front type analysis is required. For understanding the impact of random noise on the ML model, cross-validation should be used to show the difference between two different architectures is consistent and statistically significant.
Response: The CFSv2 model we choose is widely-used around the world for extended-range forecast and represents the top ability in weather forecasting. Also, considering we are extracting the low frequency component as the predictand, the filtering method prefers input data being continuous time series. So we also chose the CMA, UKMO, KMA and NCEP model outputs forecast by S2S (sub-seasonal to seasonal prediction project) from ECMWF, and they are the only four models that provide daily extended-range prediction in consistent with our models as well as CFSv2. You may note that the forecast we added all have a weaker performance compared with CFSv2. That is perhaps because according to Saha et al. (2013), the final result of CFSv2 operational forecast is actually an ensemble mean of 16 runs, but the forecasts we added later only contains one control run. So that is the reason why the gap between CFSv2 forecast and other forecasts are that wide (Fig. S2). It can also tell the superiority of CFSv2.
As for the question that how different parameters impact the model’s performance. We carried out some experiments by training SE-Resnet and Resnet with different numbers of residual blocks. As is shown in above Fig. S1, we are able to know how the parameter of residual block number influence the performance of the models, by comparing the average RMSE these models have in forecasting Z500 and T850 10-30 day ahead. Finally, we chose the 25 blocks, which is the best choice due to the lowest RMSE for both Z500 and T850.
We also used different test sets to verify the structural differences between SE-ResNet and ResNet models (Tab. S1) for the cross-validation. The results show that the prediction effect of SE-ResNet is stably better than that of ResNet as the test set changed. Particularly, the average gap for Z500 is 4.03 m2 s-2, the average gap for T850 is 0.01 K. The table shown below not only indicate that the addition of SE module can improve the ResNet model to a certain extent, but also shows that the superiority is steady as for all the test subsets, SE-Resnet outperforms ResNet. Based on these facts, we can say that the SE blocks significantly improve the performance of the network.
Specific comments:
5. Figures 2, 7, A1 need to be bigger because in their current size it is almost impossible to see the details being discussed. Perhaps this was just formatting issue for the pre-print.
Response: Thank you for your suggestions. We have modified the size of these pictures.
6. Line 246: 1.01% lower RMSE is a very small difference. It would be good to see statistical significance. Please see above my suggestions of Pareto front and cross-validation.
Response: To verify the difference between them, we conducted average test on the prediction effects of ResNet and SE-ResNet model. The results showed that the two models passed the significance test (α = 0.05) within 10-16 day of continuous forecast (Fig. S3). However, with the extension of the forecast time, the prediction effects of the two models are close to the climatology prediction, and the difference is gradually reduced, which leads to a smaller average increase of SE-ResNet model within 10-30 day.
7. Lines 252-254: There is some odd phrasing which talks about 50% of deep learning models forecast being below climatology, but for CFSv2 50% of forecasts are above climatology. I am sure the authors did not mean to suggest this, but it currently reads as if CFSv2 and deep learning models are the same, but phrased in favour of deep learning models.
Response: Thank you for your advice. What we want to express is that 75 % of the samples predicted by the deep learning model are below the climatological forecast in 10-15 day, and more than 50 % of the samples predicted remain below the climatological forecast after that. While the CFSv2 model predictions have more than 50 % of the samples higher than the climatological forecast in 16-20 day and beyond. So the deep learning models outperforms CFSv2. We have added the new description in our new MS.
8. Line 265: 0.75% RMSE improvement. Statistical significance?
Response: Thank you for your suggestion. This question is the same one posed on line 246, please refer to the answer above.
9. Line 272: It would be good to see the RMSE values for other models quoted too. A table of comparison would be really useful for at-a-glance interpretation.
Response: Thank you for your advice. We have also provided a table version of the RMSE comparisons. Also, we have add three new forecast models into the comparison, the CMA model, the UKMO model, the KMA model and the NCEP model. Please refer to Tab. S2 in our new MS.
10. Line 288, 291: SE-Resnet and Resnet are so close that there appears to be no difference between them. What is the main benefit of SE-Resnet for this application?
Response: Although there is only a slight improvement between SE-Resnet and Resnet in RMSE and only passed the significance test (α = 0.05) within 10-16 day of continuous forecast, we believe that the SE block plays an indispensable role here. After introducing this block into the network, we are able to determine the importance of the input variables so that the model's simulation of the occurrence and development of the weather system is more realistic, as is shown in Fig. 7 of MS.
11. Line 315: SE-Resnet and CFSv2 are very close, so what are the improvements being brought about by SE-Resnet?
Response: The final results of the two networks are close but there are actually significant improvements in SE-Resnet. We performed statistical tests for the RMSE of the models and found out that the numbers passed the test (α = 0.05) within 10-30 day, and the RMSE of SE-ResNet is 558.68 m2 s-2(2.19 K) while the RMSE of CFSv2 is 577.65 m2 s-2(2.29 K), which is able to prove that significant improvement exists.
-
AC3: 'Reply on RC2', Dingan Huang, 13 Sep 2022
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
981 | 250 | 53 | 1,284 | 49 | 32 |
- HTML: 981
- PDF: 250
- XML: 53
- Total: 1,284
- BibTeX: 49
- EndNote: 32
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1