the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Adapting a deep convolutional RNN model with imbalanced regression loss for improved spatio-temporal forecasting of extreme wind speed events in the short to medium range
Daan R. Scheepens
Kateřina Hlaváčková-Schindler
Claudia Plant
Download
- Final revised paper (published on 10 Jan 2023)
- Preprint (discussion started on 12 Jul 2022)
- Supplement to the preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-599', Anonymous Referee #1, 14 Jul 2022
The authors investigate ConvLSTM-based models for wind speed prediction at lead times up to 12 hours, motivated by energy applications and with a focus on Europe. The central contribution of the paper is an investigation of different types of loss functions with the aim to improve predictions of extreme events.
Overall the paper is well-written and easy to follow. While it presents a new and interesting perspective on the training of LSTM models, there are several key issues in my view. Most importantly, in my view, the investigation and evaluation of forecasts of extremes should be better motivated and connected to the intended applications; and the description of technical details of the model need to be improved. These and additional comments are detailed below.
Major comments
- Perhaps the most important issue to me is that the investigation of extremes, and the specific approach taken in the paper, should be motivated better and be connected better to the application:
- For example, the focus is on the relative rarity at each coordinate (page 7), rather than on the exceedances of thresholds, which is stated as the motivation in the introduction for shutting off turbines at certain thresholds. While I understand that considering relative rarity makes the modeling easier, it should be better explained why this would be practically relevant for wind energy applications. More generally, what is the aim of the proposed models? Should they work for all outcomes, but be better at predicting extremes? Or should they specifically focus on extremes, but not care so much about non-extremes? Relevant literature from statistics on evaluation (e.g. Brehmer and Strokorb, 2019, DOI: 10.1214/19-EJS1622, and references therein) could be consulted as a starting point for discussions on this aspect.
- What is the potential application of the forecasts as they are produced by the models in the paper in practice? While improving predictions for extremes, the quality for non-extremes will likely get worse. From an application perspective, what would be specific economic situations of users that would motivate the use of the proposed models.
- Regarding 2., a more practically useful approach to me seems to be a model that predicts probabilities of the exceedance of critical thresholds for shutting off turbines. This would be directly connected to different kinds of potential losses, and probabilities allow for optimal decision making in applications.
- All considered models are relatively complex (ConvLSTM models accounting for the spatial structure9). To be able to compare models in a fair way, more simple benchmark models should be considered as well. In particular, a grid-point wise standard NN model that uses previous time steps locally at the single grid point of interest, or similarly motivated local, per-grid-point LSTM models should be included in the comparison to be able to evaluate whether the temporal and spatial aspects of the proposed models are truly relevant.
- The model description in Section 2.2 is rather short and does not include all relevant details to independently replicate the work. For example, which fields are exactly used as inputs (only wind speed? which previous time steps?), what does "12 hour input and 12 hour prediction" mean (page 8, line 209)? Are the predictions made hourly? Which inputs are exactly used at each time step?
Some more details are provided in Section 2.2.3, but it does not become clear how you selected the hyperparameters (optimization algorithm, learning rate, ...). Did you try different values and how robust are the results in terms of these choices?
Minor comments
- The last sentence of the abstract should be moved to the acknowledgements section.
- The literature review should also refer to probabilistic predictions, which are of critical importance for extremes. For example, there has been lots of work recently on probabilistic energy forecasting, in particular in the domain of combining physical-statistical hybrids. With regards to post-processing NWP models, the recent of of Phipps et al (2022, https://doi.org/10.1002/we.2736) appears to be relevant in the context of the discussion on page 3, line 68-71).
- page 4, line 107-114: The discussion of data-driven weather forecasting seems to not be relevant for the remainder of the paper, as the applications consider rather different time scales and variables.
- page 6, line 156: Why did you use 1000 hPa wind fields instead of the surface wind fields? Wouldn't this be a more relevant target for energy applications?
- page 7, line 185f: Doesn't the normalization performed here implicitly assume Gaussianity? As an alternative approach, it would have been possible to simply select the per-grid-point quantiles from the climatology of historic observations in the training set.
- There seem to be a lot of options for choosing the weight function in equation (3): Have you tested any alternatives and performed comparisons? How much do the results depend on the specifities of the definition of this weight function?
- page 12: Ferro and Stephenson (2011) argue that SEDI should only be applied to calibrated forecasts to guarantee convergence to a meaningful limit for rare events in the sense that the number of predicted events equals the number of observed events. Is this the case for the models considered here?
- page 12: Would it in principle also be possible to incorporate the SEDI loss in the model estimation, similar as recently proposed for the FSS in Lagerquest and Ebert-Uphoff (https://arxiv.org/abs/2203.11141)? This again relates to the question of what the actual goal is here for the models.
- Table 2+3: Are the results averaged over lead times?
- page 15, line 342: Why does the ensemble not include all 5 models, i.e. also the MAE and MSE based ones? This 5-model ensemble should be added to the comparison.
- page 20, line 409f: The shuffling procedure does not become completely clear here: Do you shuffle full fields, or also simply grid points within a field?
- page 20 / Figure A1: Why is the evaluation here based on RMSE, rather than SEDI scores as before?
Citation: https://doi.org/10.5194/egusphere-2022-599-RC1 -
AC1: 'Reply on RC1', Daan Scheepens, 28 Sep 2022
Dear Reviewer,
Thank you for your review and suggestions for improving the clarity of the manuscript. We proceed to answer your questions to the best of our ability. The changes and clarifications will be added to the final manuscript.
Major comments:
- &
- We will address the points about model application and motivation better in the introduction. The aim of the paper is to investigate how the spatio-temporal predictions of a deep learning forecasting model may be improved for the extremes through the manipulation of loss function. Indeed, it is likely that any improvements for the extremes go hand-in-hand with predictions deteriorating for non-extremes. We will attempt to show whatever trade-offs there are in this regard more clearly in the results. From an application perspective, such a trade-off would nevertheless be very attractive to improve a model’s efficacy as a warning-system: Its ability to distinguish different levels of extremes would be critical while its ability to distinguish different levels of non-extremes would be irrelevant.
Furthermore, the focus on relative rarity in our definition of an extreme event will be better motivated in the methodology. What we mentioned perhaps too briefly in our manuscript is that the primary reason for making this choice stems from the fact that extreme winds in the absolute sense (e.g. exceeding 25 m/s) occur exclusively off-shore in a very localised region and are thus absent from the majority of coordinates in the ERA5 reanalysis data (where typically only max. wind speeds of 8-10 m/s are present). Defining extreme events rather in terms of their relative rarity at each coordinate allows us to investigate forecasting improvements of extreme events more generally by looking at improvements on the tails of the respective distributions, regardless of what absolute values these tails actually obtain. Demonstrating that forecasting performance on the tails can be improved in this more general context, by adapting the loss function, can be swiftly translated to other cases where the tails of the distributions denote actual hazardous events. This paper serves to help indicate as to how the loss function may be adapted as to best help improve forecasting performance on the tails. indeed, with the assumption that the tails typically denote extreme events of some form. We will modify the paragraph in question in the methodology accordingly. - It is true that probabilistic output is typically preferred by users but here we would argue that in the same way as how deterministic NWP forecasts are commonly aggregated into probabilities by utilising large ensembles, the deterministic forecasts of the ConvLSTM regression model can be aggregated into probabilities e.g. with different ensemble members trained on different subsets of the training set. We will mention this possibility in the discussion.
- We certainly understand the reviewer’s concern. However, instead of providing benchmark comparisons with other models, this point has made it apparent that we have had to more clearly state the aim of the paper in the introduction. The aim of the paper is not to investigate our adaptation of the ConvLSTM as compared with other state-of-the-art models. Rather, the aim of the paper is to investigate how the performance of a popular spatio-temporal deep learning model (the ConvLSTM) changes for various thresholds of extreme events by utilising two types of loss functions proposed in the literature on imbalanced regression. For literature on the improvements of the ConvLSTM over other state-of-the-art models, or simpler non-convolutional, or non-recurrent models, the reader could be referred to Shi et al. (2015) or Shi et al. (2017), which we will mention in the introduction.
- We will clarify this point in the methodology. 1000 hPa wind speed is the only variable used. The model takes in 12 consecutive hours of wind speed data over the 64x64 grid, comprising a tensor of size 12x64x64. This tensor is encoded through the encoding network into a hidden state and decoded through the decoding network into an output tensor of size 12x64x64 comprising the forecast of the subsequent 12 hours. The temporal correlations between consecutive hours are taken into account implicitly by the convolutional LSTM layers. For the exact details on how the convolutional LSTM modules achieve this we will refer the reader to Shi et al. (2015).
Minor comments:
- Noted.
- Thank you for the comment. Indeed, this is missing in the literature review as the focus on probabilistic was added after the introduction seemed to be finished. We have updated the introduction accordingly and added the following paragraph:
"Furthermore, a lot of work has been done in recent years on probabilistic weather forecasting and many postprocessing methods have been proposed to improve probabilistic forecasts. Postprocessing is typically applied to ensemble weather- or energy forecasts and attempts to correct biases exhibited by the system and improve overall performance (see e.g. Phipps et al. (2022)) but has been explored to a lesser degree in the context of extreme event prediction. One approach to postprocess ensemble forecasts for extreme events is to utilise extreme-value theory, a review of which can be found in Friederichs et al. (2018). The authors propose separately postprocessing toward the tail distribution and formulate a postprocessing approach for the spatial prediction of wind gusts. Other authors have explored the potential of ML in this context. Ji et al. (2022), for example, investigate two DL-based postprocessing approaches for ensemble precipitation forecasts and compare these against the censored and shifted gamma distribution-based ensemble model output statistics (CSG EMOS) method. The authors report significant improvements of the DL-based approaches over the CSG EMOS and the raw ensemble, particularly for extreme precipitation events. Ashkboos et al. (2022) introduce a 10-ensemble dataset of several atmospheric variables for ML- based postprocessing purposes and compare a set of baselines in their ability to correct forecasts, including extreme events. Alessandrini et al. (2019), on the other hand, demonstrate improved predictions on the right tail of the forecast distribution of analog ensemble (AnEn) wind speed forecasts using a novel bias-correction method based on linear regression analysis, while Williams et al. (2014) show that flexible bias-correction schemes can be incorporated into standard postprocessing methods, yielding considerable improvements in skill when forecasting extreme events." - The review of data-driven weather forecasting models will be removed from the introduction as we have come to see that this is, indeed, not the focus of the manuscript.
- For this study initially 5 different pressure levels, including the diagnostic 10 m wind fields, were investigated. With the currently implemented hub heights of wind turbines in Austria of 100 to 135 m a.g.l., the 1000 hPa fields are more appropriate (corresponding to ca. 100-130 m in Eastern Austria (main wind energy region)). Furthermore, reanalysis methods typically interpolate across- and output data at different pressure levels rather than height levels, which also motivates the choice. We have modified the methodology accordingly.
- Thank you for pointing this out. We have decided to repeat the experiments utilising a Yeo-Johnson power transform (Yeo and Johnson, 2000) before the zero-mean, unit-variance normalisation in order to make the distributions more Gaussian-like.
- Noted. We will include another, linear, weighting method in the comparison and will include the SERA loss with three different sets of control-points in order to provide a more complete comparison.
- Thank you - this was indeed overlooked. The final results will be calibrated as recommended in Ferro and Stephenson (2011).
- In principle it would be possible to incorporate the SEDI loss or another loss function into the model. This, however, was out of scope for this work. Machine learning based methods, as well as also statistical methods, tend to smoothen the forecasts and underestimate especially the tails. The idea was to implement a loss function which is able to account for that, sharpen the forecasts and is able to get the intensities in the right order. This is essential for not only wind energy applications (planning of feed-in, curtailment, etc.) but also for e.g. tourism (winter sports), transportation, and forestry. We added the following to the discussion:
"Another possible extension of this work would be implementing either the SEDI or the FSS as a loss function (see e.g. Lagerquist and Ebert-Uphoff, 2022) or even combine the ConvLSTM with a so-called physics-aware loss function (see e.g. Schweri et al., 2021; Cuomo et al., 2022)." - Rather than averaged, the results are aggregated over all lead-times. This will be clarified.
- Noted.
- Full fields. We will make sure to clarify this.
- Because Fig. A1 is a comparison of the continuous-valued wind speed fields and as such requires a continuous score like the RMSE to compare. The SEDI is a categorical score that can only be used with discrete data considering events and non-events i.e. after applying a threshold to the continuous wind speed fields. We will make sure to clarify this.
Citation: https://doi.org/10.5194/egusphere-2022-599-AC1
- Perhaps the most important issue to me is that the investigation of extremes, and the specific approach taken in the paper, should be motivated better and be connected better to the application:
-
RC2: 'Comment on egusphere-2022-599', Anonymous Referee #2, 07 Aug 2022
This paper compares weighting method for addressing the imbalance problem posed by the prediction of rare events in atmospheric data. The weighting methods are reasonable and the computational experiments are appropriate. However, the paper does no go in sufficient details to give insights into whether a weighting schemes is intrinsically more suited than another. In addition, probability-based schemes are presumably subject to uncertainties which are not appropriately discussed.
Major comments
The comparison of WMAE, WMSE (Method 1) and SERA (Method 2) do not seem apples-to-apples. Method 1 still examines all the data points, while Method 2 discards datapoints that are not of interest. In this sense, it seems unfair to compare these two techniques. This issue is not a fatal flaw of the manuscript but should be mentioned and used to qualify the statements that compare both methods.
If balancing data points is important, then errors in the balancing scheme may also be critical. In particular, since probability of rare events are more difficult to estimate than high probability events (higher relative error), how do these errors affect the conclusions of the manuscript? Ideally, these errors should be addressed in the training experiments. If not possible, the manuscript should explicitly discuss this caveat.
The manuscript focuses on describing the results of the experiments rather than explaining the different behaviors observed with the different losses. Could the authors provide insights into why the SERA loss performs better or worse? These explanations would help the readers generalize the present findings. At the moment, it is unclear what would happen if the relevance function was different? What is the effect of the SERA threshold? What if the SERA loss includes all the datapoints rather than only a few? At least, these details should be discussed. At best, additional experiments would be useful.
The authors go to great lengths to explain what DL architectures are suited for extreme event prediction, while this is not the main focus of the manuscript. It would be preferable to expand on the loss balancing mechanisms for extreme event prediction. Here are a few possible references to discuss.
For capturing the tails of PDF, log transformation has been proposed (Sequential sampling strategy for extreme event statistics in nonlinear dynamical systems”), or by rebalancing the dataset itself which is useful for classification tasks (“A study of the behavior of several methods for balancing machine learning training data”) and regression tasks (“Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows”).
It is unclear how the ensemble is constructed. Did the authors average the predictions ? Are all models identically weighted?
The title is not descriptive enough and may not help many readers. The title should explicitly reflect that the prediction of extreme events is addressed by balancing the loss.
Minor comments
P1 L5: “It has become […] challenges”. This sentence needs to be in the intro not the abstract and could use a citation
P3 L54: No need for quotes around Weibull
P3 L80: “Given to the model” is colloquial. Please use something like “are processed by the model”
P3 L84: Remove “excellent”
L 251: citation missing for Pytorch
Sec.2.3: Validation would be better. Verification may carry a different meaning in computational science.
Rephrase paragraph around L265 and possibly try to be more quantitative. What does ``messy” forecast mean?
Table 2: what happens if the architectures become even more complex?
Since the SERA loss depends on a threshold, please indicate what threshold was used as a subscript of `SERA` when results are displayed.
Figure A1 is actually very informative and could be used in the main text. It would be useful if the authors could also show the integral time scale on that graph.
The discussion around frequency bias could benefit from an equation that showcase how the frequency bias is calculated.
L 446: Did you mean “successor” instead of “predecessor”?
Citation: https://doi.org/10.5194/egusphere-2022-599-RC2 -
AC2: 'Reply on RC2', Daan Scheepens, 28 Sep 2022
Dear Reviewer,
Thank you for your review and suggestions for improving the clarity of the manuscript. We proceed to answer your questions to the best of our ability. The changes and clarifications will be added to the final manuscript.
Major comments:
- There seems to be a slight misunderstanding here. The integral in eq. 5 goes over thresholds t in [0,1] where t=0 takes into account all datapoints. Only then, e.g. at t = a only those points with relevance >= a are included in the integral. At increasingly higher thresholds, increasingly many points are, indeed, discarded from the computation but they are not absent from the final integral. We will make sure, however, to clarify this subtlety in the methodology. We will also make sure to clarify the differences between the SERA and the re-weighing of the MAE or MSE, but do wish to highlight their common goal, which is to increase the importance of the tails in the loss function. For this reason we do not see that a comparison of the two is in any way unsound or unfair. Indeed, we would argue that the fact that these two methods attempt to achieve the same goal by different means is surely what makes the comparison of interest in the first place.
- We are not sure whether we fully understand this point as the model that we investigate outputs deterministic predictions, not probabilities. We do, however, acknowledge that model errors tend to increase for the distributional tails due to larger absolute values. Having said that, for a prediction to be correct the SEDI requires only that a prediction and observation pair both surpass some thresholds t_p and t_o (respectively), regardless of how largely t_p and t_o were in fact overshot. In terms of discrete extreme event prediction, the continuous errors are thus irrelevant. We have decided, however, to include in the results a continuous score (such as the RMSE) between the continuous prediction and observation fields, including its variation between the same set of thresholds used for the SEDI, to show how the continuous errors differ between the different models i.e. loss functions.
- Noted. We have decided to include another, linear, weighting method in addition to the inverse weighting, and will provide results of the SERA loss with its lower control-point set to either the 90th, 75th or 50th percentile while keeping its upper control-point fixed at the 99th percentile. We will also go to greater lengths to compare and contrast the results of the different loss functions in order to provide a more complete picture.
- Noted. The literature review on DL methods will be removed from the introduction, as, indeed, this is not the main focus of the manuscript. Instead, additions will be made to the review of extreme event predictions and another paragraph will be added in which the aim and the motivation of the paper are more clearly stated.
- The ensemble was constructed by averaging the individual predictions, equally weighted. We will make sure to clarify this.
- Noted. We propose changing the title as follows: "Adapting a deep convolutional RNN model with imbalanced regression loss for spatio-temporal forecasting of wind speed extremes in the short-to-medium range".
Minor comments will all be incorporated.
Citation: https://doi.org/10.5194/egusphere-2022-599-AC2
-
AC2: 'Reply on RC2', Daan Scheepens, 28 Sep 2022