Adapting a deep convolutional RNN model with imbalanced regression loss for improved spatio-temporal forecasting of extreme wind speed events in the short-to-medium range

. The amount of wind farms and wind power production in Europe, both on-and off-shore, has increased rapidly in the past years. To ensure grid stability, on-time (re)scheduling of maintenance tasks and to mitigate fees in energy trading, accurate predictions of wind speed and wind power are needed. Furthermore, accurate predictions of extreme wind events are of high importance to wind farm operators as timely knowledge of these can both prevent damages and offer economic preparedness. This work explores the possibility of adapting a deep convolutional recurrent neural network (RNN) based 5 regression model for the spatio-temporal prediction of extreme wind speed events in the short-to-medium range (12 hour lead-time in one hour intervals) through the manipulation of the loss function. To this end, a multi-layered convolutional long short-term memory (ConvLSTM) network is adapted with a variety of imbalanced regression loss functions that have been proposed in the literature: Inversely weighted, linearly weighted and squared error-relevance area (SERA) loss. Forecast performance is investigated for various intensity thresholds of extreme events and a comparison is made with the commonly 10 used mean squared error (MSE) and mean absolute error (MAE) loss. The results indicate the inverse weighting method to most effectively shift the forecast distribution towards the extreme tail, thereby increasing the number of forecasted events in the extreme ranges, considerably boosting the hit rate and reducing the root mean squared error (RMSE) in those ranges. The results also show, however, that such improvements are invariably accompanied by a pay-off primarily in terms of increased overcasting and false alarm ratios, which increase both with lead-time and intensity threshold. The inverse weighting method

1 Introduction 20 Global warming demands ever more urgently that electricity generation is shifted away from fossil fuels and towards renewable energy sources. Although global demands for fossil fuels are not yet showing signs of decreasing, renewables are on the rise.
In 2021, more than half of the growth in global electricity supply was provided by renewables, while the share of renewables in global electricity generation reached close to 30 %, having steadily risen over the past decades (IEA, 2021). Possessing the largest market share among the renewables, wind energy has managed to establish itself as a mature, reliable and efficient 25 technology for electricity production and is expected to maintain rapid growth in the coming years (Fyrippis et al., 2010;Huang et al., 2015). Thanks to continued advancements in on-and offshore wind energy technology and the associated continued reduction in costs, wind power capacity could grow from having met 1.8 % of global electricity demand in 2009 to meeting roughly 20 % of demand in 2030 (Darwish and Al-Dabbagh, 2020). Indeed, many countries have already demonstrated that hybrid electric systems with large contributions of wind energy can operate reliably. For example, in as early as 2010, Denmark, 30 Portugal, Spain and Ireland managed to supply between 10 and 20 % of annual electricity demand with wind energy (Wiser et al., 2011) and the numbers have only risen since.
One of the main challenges to the deployment of wind energy, however, is its inherent variability and lower level of predictability than are common for other types of power plants (Lei et al., 2009;Chen and Yu, 2014;Li et al., 2018). Hybrid electric systems that incorporate a substantial amount of wind power therefore require some degree of flexibility from other 35 generators in the system in order to maintain the right supply/demand balance and thus ensure grid stability (Wiser et al., 2011).
Failing to manage this variability leads to scheduling errors which impact grid reliability and market-based ancillary service costs (Kavasseri and Seetharaman, 2009), while potentially causing energy transportation issues in the distribution network (Salcedo-Sanz et al., 2009) and increased risks of power cuts . This is where wind speed forecasting can play a significant role. Incorporating high-quality wind speed forecasts, and, in return, wind power forecasts, into electric system 40 operations gives the system more time to prepare for large fluctuations and can thereby help mitigate the aforementioned issues (Wiser et al., 2011). The variability in the short-range, particularly over the time scale of one to six hours is found to pose the most significant operational challenges (Wiser et al., 2011;Li et al., 2018). The development of accurate wind speed forecasts in the short-range has thus become increasingly important.
Short-term wind speed prediction is not just a key element in the successful management of hybrid electric power systems, Alessandrini et al. (2019), on the other hand, demonstrate improved predictions on the right tail of the forecast distribution of analog ensemble (AnEn) wind speed forecasts using a novel bias-correction method based on linear regression analysis, while Williams et al. (2014) show that flexible bias-correction schemes can be incorporated into standard postprocessing methods, 90 yielding considerable improvements in skill when forecasting extreme events.
As data-driven forecasting model this paper investigates an adaptation of a deep convolutional LSTM (ConvLSTM) regression model, as proposed by Shi et al. (2015) for precipitation nowcasting in the range of 0-6 hours. The capability of deep ANNs to automatically and effectively learn hierarchical feature representations from raw input data have made DL models particularly attractive to the area of spatio-temporal sequence forecasting, where complex spatial and temporal correlations are 95 typically present in the data (Shi and Yeung, 2018;Amato et al., 2020;Wang et al., 2020). The ConvLSTM is an example of a ConvRNN model, which forms a synthesis of a convolutional neural network (CNN) and a recurrent neural network (RNN).
CNNs are a class of feedforward artificial neural networks, used primarily for data mining tasks involving spatial data and have gained a lot of attention in the area of computer vision and natural language processing (Ghosh et al., 2020), while RNNs are known for their powerful ability to effectively model temporal dependencies (Shi et al., 2015). By utilising the strengths of the 100 CNN to capture spatial correlations and the RNN to capture temporal correlations in the data, ConvRNN models have demonstrated very promising forecasting ability in the spatio-temporal setting , outperforming both non-recurrent convolutional models, as well as non-convolutional LSTM models (Shi et al., 2015(Shi et al., , 2017. As a multi-layered ConvRNN model, the deep ConvLSTM thus has the potential to effectively model the complex dynamics of the spatio-temporal wind speed forecasting problem.

105
In this paper an adaptation of a deep ConvLSTM regression model is applied to the task of extreme wind speed prediction.
The model is adapted with different types of imbalanced regression loss and their efficacy in improving predictions on the tails of the local wind speed distributions at each coordinate is compared. As such, this paper attempts to shed light on how the loss function of a deep learning model may be best adapted to improve forecasting performance on the distributional tails. Such improvement has practical relevance to wind energy applications where obtaining accurate predictions of extreme focus lies on the 1000 hPa pressure level data which typically varies between 100 and 130 m above ground level, corresponding to the most common hub heights in the eastern, flat part of Austria (main wind energy region). Not shown in this study are results of the surface wind speed and other pressure levels. The U and V components of the horizontal wind velocity (in m s −1 ) were taken at 1000 hPa from the ERA5 hourly data on pressure levels from 1979 to present to calculate the scalar wind speed.
The data was collected with a temporal resolution of one hour between 01 January 1979 and 01 January 2021 (42 years) on a 125 spatial grid over central Europe. Of these data, the last two years between 2019-2021 were held out as testset. The eight years between 2011-2019 were used for training and validation in the first part of the experiment, using 4-fold cross validation (with six years training and two years validation data) to determining optimal model architecture for each of the investigated loss functions. In the second part of the experiment the optimal model architectures were then trained and validated on the entire 40 years of data between 1979-2019, using the eight years between 2011-2019 as validation.
This region was selected for its geographical variation, as it includes both land and sea regions as well as flat and mountainous areas, while the region is furthermore divided into different climatic regions such as the Pannonian climate region in Eastern Austria and the Alpine climate region covering the Austrian Alpine range. Interplay between these features can result in highly complex wind dynamics, which is where the application of deep learning is expected to be particularly promising. Moreover, 135 the fine spatial resolution of 0.25 • is expected to be critical to capturing the complex fine-scale dynamics of a variable like low-level wind, and thus improving forecasting ability. The resolution also marks an important step forward for data-driven models to be truly competitive with state-of-the-art numerical weather prediction models, which are run at ≈ 0.1 • resolution (Pathak et al., 2022).
A visualisation of the data is provided in Fig. 1. The figure shows on the right an example time slice and on the left wind 140 speed time series of three arbitrary locations over the duration of one month, including the climatological means at these locations. Evidently, the local climatological means (and by extension, the local wind speed distributions) vary substantially throughout the region, where striking differences in magnitude can be observed between the off-shore and on-shore regions. To highlight these spatial differences, Fig. 2 shows the maximum, mean and standard deviation of the wind speed over the region, which unveil a sharp division of the statistics with the underlying coastlines of the region. Indeed, extreme winds (e.g. larger 145 than 25 m s −1 ) seem to occur almost exclusively off-shore. If there were, in fact, stronger winds present over this region of mainland Europe between 1979 and 2021 then they have not been captured by the hourly ERA5 reanalysis.
Thus, rather than defining extreme winds in terms of their absolute severity, extreme winds are here defined in terms of their relative rarity at each coordinate. This definition focuses the forecasting problem on the tails of the respective distributions at each coordinate, which ensures that the forecasting of extremes is conducted over the entire region, rather than only locally 150 over some particularly dominant area. By selecting a distributional percentile (e.g. the 99 th percentile), extreme winds are then defined as those wind speeds surpassing the percentile threshold of the wind speed sample distribution at the respective coordinate i.e. wind speeds that are, indeed, rare at that coordinate (although not necessarily severe or hazardous in a absolute sense). For the remainder of this paper, the term 'p th percentile threshold' refers always to the p th percentile at each coordinate of the target observation field. The above approach allows us to investigate forecasting improvements of extreme events more   generally by looking at improvements on the tails of the respective distributions (in terms of percentiles), regardless of the absolute values of the tails. Any improvements on the tails that result from the loss function modifications investigated in this paper can be swiftly translated to other cases where the tails of the distributions denote actual hazardous events.
Finally, the data were preprocessed at each coordinate using a Yeo-Johnson power transform (Yeo and Johnson, 2000) to make the local wind speed distributions more Gaussian-like and were subsequently standardised locally using zero-mean, unit-160 variance normalisation. The optimal parameter for stabilising variance and minimising skewness in the power transform was estimated through maximum likelihood.

Model Description
The model implemented for the task of spatio-temporal forecasting of wind speed is an adaptation of the convolutional long short-term memory (ConvLSTM) network, as proposed by Shi et al. (2015) for precipitation nowcasting. However, while The deep ConvLSTM model architecture is adopted with an encoding-forecasting network structure (common for spatio-170 temporal sequence forecasting) where both encoding and forecasting networks consist of several stacked ConvLSTM layers.
As depicted in Fig. 3, the encoding ConvLSTM network compresses the input into a hidden state tensor and the forecasting ConvLSTM network unfolds this hidden state into the final prediction. The model is implemented as a multi-frame forecasting model, with 12 hour input and 12 hour prediction. This means that the model takes in tensors of size (12 × 64 × 64) as input, consisting of the previous 12 hours of wind speed over the (64 × 64) grid, which are then encoded all together through various 175 hidden states of the encoding network and decoded through the decoding network into a subsequent 12-hour prediction tensor of size (12 × 64 × 64). The model was implemented and trained using Pytorch (Paszke et al., 2019), the code of which can be found under: https: These implementation and parameter choices were selected a priori based on the work of Shi et al. (2015) and Shi et al. (2017). Model performance may certainly be improved by performing a thorough hyper-parameter optimisation but that is not the focus of this paper. The focus is set, instead, on the different loss functions proposed in the literature for spatio-temporal imbalanced regression using deep learning and compare these in terms of their improvement in the prediction of spatio-temporal 195 wind speed extremes using the ConvLSTM model.

Weighted Loss
In order to combat the effects of data imbalance on the imbalanced regression problem, the ConvLSTM model is adapted with two different types of loss functions that have been proposed for imbalanced regression problems. The first of these is the relatively simple weighted loss, which consists of assigning a weight w(y) to each value in the input frame according to its As weighting function both an inverse weighting function as well as a simple linear weighting function are investigated.

205
The inverse weighting function computes the weights in proportion to the inverse of of the data distribution for each target, as suggested by Yang et al. (2021). For a continuous target distribution, this typically implies discretising the distribution into intervals (see e.g. Shi et al., 2017), where all predictions within an interval are weighted by the same weight. Due to our definition of extreme events in terms of local percentile thresholds we proceed to discretise the target distribution into intervals spanning the percentage of the distribution between percentile p and 100. For a set of increasing percentiles P = 210 {p k }, all targets p k ≤ y < p k+1 are then weighted proportionally to the inverse of the percentage between p k and 100 i.e. w(y) ∝ 1/(100 − p k ). We utilise a range of integer percentiles P = {p k |k ∈ [50, 99]} and normalise weights such that the interval between percentiles 50 and 51 is given unit weight. As such, weights increase inversely from 1 up until a weight of 50 (given to target values y 99 ≤ y ≤ y 100 ). All values smaller than the 50 th percentile (p 50 ) are also given unit weight. This results in the weighting function shown in Eq. 2, which is also presented graphically in Fig. 4.
The linear weighting function is constructed analogously as shown in Eq. 3. Target values y < p 50 are similarly given unit weight, while weights for target values p k ≤ y ≤ p k+1 are increased linearly from 1 to 50 for percentiles k ∈ [50, 99]. (3)

220
As a second approach to combating data imbalance, the squared error-relevance area (SERA) loss is investigated, as proposed by Ribeiro and Moniz (2020). The SERA loss is based on the concept of a relevance function ϕ : Y −→ [0, 1], which maps the target variable domain Y onto a [0, 1] scale of relevance. The relevance function ϕ is determined through a cubic Hermite polynomial interpolation of a set of 'control-points'. The set of control-points S = {⟨y k , ϕ(y k ), ϕ ′ (y k )⟩} s k=1 are user-defined points where the relevance may be specified, which are typically local minima or maxima of relevance and thus all have 225 derivative ϕ ′ (y k ) = 0 (Ribeiro and Moniz, 2020).
In this implementation the method is implemented on a per-coordinate basis and the local 99 th percentile (p 99 ) at each coordinate is fixed as the point of maximum relevance, but the point of minimum relevance is varied between either the 90 th percentile (p 90 ), the 75 th percentile (p 75 ) or the 50 th percentile (p 50 ), in order to get a better idea of how this choice affects forecasting performance. The interpolation in all cases is carried out according to Ribeiro and Moniz (2020)  Defining D t as the subset of data pairs for which the relevance of the target value is greater or equal than a cut-off t, i.e. D t = {⟨x i , y i ⟩ ∈ D|ϕ(y i ) ≥ t}, the squared error-relevance SER t of the model with respect to the cut-off t is computed as follows: whereŷ i and y i are the i'th prediction and target values, respectively. The curve obtained by plotting SER t against t is decreasing and monotonic (Ribeiro and Moniz, 2020) and provides an overview of how the magnitudes of the prediction errors change on subsets comprising varying degrees of relevant samples (t = 0 representing all samples and t = 1 representing only the most relevant samples). Finally, the squared error-relevance area (SERA) is defined as the area under the SER t curve: The smaller the area under the curve is, the better the model is. We note that assigning uniform relevance values to all data points recovers the MSE loss. We also note that regardless of the choice of relevance function, the SERA loss utilises all given samples in its computation, as the integral in Eq. 5 starts at t = 0 and SER t=0 denotes all samples with relevance values greater or equal than 0 i.e. all samples.

Forecast Evaluation
In order to evaluate the predictions of the ConvLSTM against observation, the model's hit rate (H = a a+c ), the false alarm ratio (FAR = b a+b ), the threat score (TS= a a+b+c ) and the frequency bias (B = a+b a+c ) are investigated, where a denotes the number of hits, b the number of false alarms, c the number of missed hits and d the number of correct negatives obtained by the model.
The hit rate, false alarm ratio and threat score are routinely used by the UK Met Office to evaluate warnings (Hogan and Mason,250 2012) and have also been used by Shi et al. (2015) to evaluate the ConvLSTM model for precipitation nowcasting, while the frequency bias provides valuable information on whether the model tends to overforecasting or underforecasting.
These scores are computed for a set of intensity thresholds corresponding to the local 50 th , 75 th , 90 th , 95 th , 99 th and 99.9 th percentiles of the observed sample distributions at each coordinate, which are computed using the training set. In order to obtain an aggregated result over all forecasts made by a model, the elements in the 2 × 2 contingency table are aggregated over 255 all forecasts and the scores are computed subsequently from the aggregated contingency table.
Since the above categorical scores work on the basis of a forecast being correct as long as it surpasses the same threshold t as the observed event, they are able to give an indication of the frequency of errors while unable to give an indication of the magnitude of the errors between forecast and observation. In order to include in the analysis a comparison of error magnitudes, the root-mean-square error (RMSE) between (continuous-valued) predictions and observations is utilised. Unlike  In the next section the results obtained from combining the multi-layered ConvLSTM network with the various loss functions are presented. The optimal number of layers for each model is determined from the minimum validation-loss obtained by the networks as averaged over the 4-fold cross validation process (conducted over the eight years of data between 2011-2019).
The optimal models are then re-trained using the entire 40 years of data between 1979-2019 (using the eight years between 2011-2019 as validation) and their results are compared on the held-out test set comprising the two years between 2019-2021.  Table 2. Minimum validation loss as obtained by the ConvLSTM network with number of layers ranging from 2-5 (denoted in brackets) and trained with the various different loss functions. Values are presented as the mean ± one standard deviation from the 4-fold cross validation.

Validation-loss
The lowest minimum validation loss reached, and thus the optimal network architecture, is emphasised in boldface for each loss function.
Where multiple architectures obtained the same minimum validation loss, the simpler architecture is preferred.
280 Table 3. Comparison of hit score (H), false alarm ratio (FAR), threat score (TS) and frequency bias (B) of the ConvLSTM network trained with the various different loss functions. Scores are presented for wind forecasts f and observations o exceeding local intensity thresholds varying between the 50 th (p50) and 99.9 th (p99.9) percentiles, aggregated over lead-time. The optimal number of network layers used for each loss function is given in brackets after the name of the loss function. The persistence forecast is included in the table for reference. For each intensity threshold, the best scores are emphasised in boldface (where applicable).

Comparison over intensity thresholds
The networks were then retrained on the entire dataset with the corresponding optimal number of network layers (henceforth indicated in brackets after the name of the respective loss function with which the network was trained). Table 3  The table shows that the imbalanced regression losses generally result in significant increases in the hit rate as compared with the standard MAE or MSE loss, indicating that more of the true occurrences of the events were captured by the model. Any improvement in the hit rate is, however, accompanied by an increase in the false alarm ratio. This suggests that in order to cap-290 ture more of the events, the models are invariably producing more false alarms. This behavior is particularly pronounced where there is substantial overcasting i.e. a frequency bias substantially greater than 1. the cost of an increased false alarm ratio and an increased frequency bias.
Compared to the weighted loss functions, the SERA offers something of an extreme case, allowing hit rates to be boosted spectacularly but at a considerable loss of forecasting performance (as judged by reduced threat scores, increased overcasting and increased false alarms). The first control-point of the SERA loss does offer a way to mitigate this behaviour, however.
For example, reducing this control-point from p 90 to p 75 and then to p 50 (while keeping the second control-point fixed at p 99 ) 325 shows a striking reduction in frequency bias, false alarms and hit rates for intensity thresholds between p 90 -p 99 , while threat scores, and thus overall forecasting performance, generally improves. Similarly to Table 3, Table 4 shows the root-mean-square error (RMSE) obtained from the continuous forecasts and observations of the different models. Unlike Table 3,

the RMSE is computed between all pairs of forecast and observation values (f, o)
where the observation values lie between p 1 and p 2 i.e. p 1 ≤ o < p 2 . Again, the persistence forecast is included for reference.

330
The table shows that the imbalanced regression losses tend to result in lower RMSE scores, as compared to the standard MAE and MSE loss, for increasingly rare observation values. It is interesting to note that while the SERA trained models generally appear to result in heavy overcasting and highly inflated false alarm rates (Table 3), the RMSE scores suggest that increasing the first control-point of the SERA loss results in shifting the domain of minimal RMSE towards the higher percentiles.
Between the inverse and linear weighting methods, the RMSE scores echo the interpretations from Table 3, with the linear 335 method appearing more adapt (lower RMSE) around the central percentiles and the inverse method more adapt towards the higher percentiles.

Temporal assessment
The performance of the models is investigated further in Fig. 6  The label 'persistence' refers to the persistence forecast.
Although the inverse weighting, the linear weighting and the SERA loss each provide a different way to balance forecasting performance towards improved hit rates, they achieve this aim with varying success. between p 95 -p 99.9 achieved by the SERA losses, followed by the inversely weighted losses and lastly the linearly weighted losses.  Table 3. While all imbalanced regression loss functions appear to shift more predictions towards the right tail of the target distribution, they evidently conserve the shape of the target distribution to varying degrees, with the SERA loss and the linear weighting resulting in rather large distortions and heavy overcasting on the right tail. In fact, the SERA loss shifts predictions towards the right tail of the target distribution with such severity that this results in an additional peak on the right side of the forecast distribution, the peak 375 evidently shifted further towards the right tail as the primary control-point varies from p 50 to p 90 .

Permutation tests
In this section, some more insight is given into the predictions made by the ConvLSTM network by discussing feature importance. In order to determine the importance of each of the 12 input frames that are used by the ConvLSTM to make its prediction, a permutation test was carried out on the input data. For each input frame at time T (-11-0), all input frames  from the testset were randomly shuffled (full fields) at time T , essentially nullifying the information flow from this input frame. Then the model predictions from these permuted inputs were obtained and a skill score S (in %) was computed between the RMSE of the original prediction and target (RMSE org ) and the permuted prediction and target (RMSE perm ) i.e. S = (1 − RMSE org /RMSE perm ) * 100. A score of 0 % indicates no change in RMSE, a score of 100 % indicates maximum increase in RMSE and negative scores indicate decrease in RMSE due to the permuted inputs. Not only does this offer insight 385 into the importance that each input frame carries in the ultimate prediction but it also helps to ensure that the model is, in fact, basing its predictions on the information flow between consecutive input frames rather than simply resorting to forecasting climatology. Figure 8 presents the RMSE skill scores as aggregated over the testset, obtained by the different models. The figure shows that scores for all models get particularly large as the permuted input frame T approaches 0 hours. This shows clearly that tend to bear more importance in the predictions (with the exception of a slight drop (for the MAE loss) or stagnation (for the MSE loss) at T = −2 and T = −1). The imbalanced regression losses, however, show an additional jump in RMSE skill 395 scores peaking around ca. T = −8 and T = −7, where-after scores fall considerably before gradually climbing again to peak at T = 0. Interestingly, with these earlier frames in the inputs bearing more importance on the predictions, it appears that the models trained with the various imbalanced regression losses have learned to utilise more of the long-term information flow in the inputs to improve the forecasting on the extremes.

Forecast examples 400
Finally, Fig. 9, Fig. 10 and Fig. 11 present visualisations of three selected, example forecasts made by the different ConvLSTM models investigated in this paper, which serve to highlight their respective strengths and weaknesses. In each figure, the first row from the top displays the 12 input frames, the second row displays the succeeding 12 target frames and the following rows display the 12 predicted frames of the various models. T refers to the index of the frame (in hours), with T = 0 denoting Figure 9. An example forecast from the ConvLSTM network trained with the various different loss functions. The first row from the top displays the 12 input frames, the second row the succeeding 12 target frames and the following rows the 12 predicted frames of the models.
T refers to the index of the frame (in hours), with T = 0 denoting the last input frame and T = +12 denoting the final target and prediction frames. Rather than showing the raw predictions, the predictions are categorised into binary events using percentile intensity thresholds.
the last input frame and T = +12 denoting the final target and prediction frames. Rather than showing the raw predictions, 405 Figure 10. As Fig. 9 the predictions are categorised into binary events using local percentile intensity thresholds. In this fashion, the figures show precisely where the different types of events were predicted and where not.
All three examples show a target observation of an intensification of extreme winds, each resulting in a patch of 99 th percentile events between from ca. T = +8 onward. In each case, the standard MAE and MSE loss either forecast the intensification to some degree but largely fail to capture the 99 th percentile events ( Fig. 9 and Fig. 11) or they fail to forecast the event 410 Figure 11. As Fig. 9 completely (Fig. 10). In comparison, inversely weighted losses (W-MAE inv and W-MSE inv ) show a much improved ability to forecast the right intensification and the right degree of extreme events, with the W-MAE inv performing clearly better in Fig.   11 than the W-MSE inv . From the forecasts of the linear weighted losses, the heavy frequency bias on lower percentile events (seen in Table 3, Fig. 6 and Fig. 7) can be easily distinguished, although some 95 th and 99 th percentile events are captured. Between the SERA p90 , SERA p75 and SERA p50 models, the examples clearly reflect the heavily inflated frequency bias towards 415 the higher percentiles, with bias increasing more towards the 99 th percentile as the primary control-point varies from p 50 to p 90 (in line with the behaviour discussed in Fig. 7).

Discussion
The results presented in this paper indicate that the multi-layered convolutional long short-term memory (ConvLSTM) network can be adapted to the task of spatio-temporal forecasting of extreme wind events through the manipulation of the loss function. and, interestingly, it appears that networks trained with imbalanced regression loss may be utilising more information flow from long-term dynamics than the baseline models trained with standard MAE and MSE loss.
The results indicate that hit rates and RMSE scores can be greatly improved for extreme events up until the 99 th percentile threshold, where-after hit rates drop considerably and cease to surpass persistence scores. Table 3 and Fig. 6 demonstrate clearly, however, that improvements in hit rate are accompanied by proportionate increases in frequency bias and false alarm 430 ratios. When this trade-off is particularly extreme, as in the case of the SERA loss with the here-investigated control-points, not only do threat scores begin to suffer considerably but the model also loses its viability as a reliable forecasting model with false alarm ratios massively inflated. Lowering the primary control-point, however, from the 90 th percentile (p 90 ) to the 50 th percentile (p 50 ) limits this behaviour for extreme events between the 90-99 th percentiles (see Fig. 6).
The linear weighting method, instead, shows minimal improvement in hit rate over the standard MAE and MSE on intensity 435 thresholds above p 90 , as it increases forecasting bias mostly closer to the median and not the tails (see Fig. 7), which means that it does not appear to put enough relative weight on the extreme tails. It should be noted, however, that the linear weighing method tested here was tested only with one slope and other slopes may yield better results. Shi et al. (2017), for example, utilised a linear weighting method for precipitation nowcasting using a trajectory gated recurrent unit (TrajGRU) network and reported improved performance at higher rain-rate thresholds as compared with the standard MSE and MAE (based on the 440 threat score and the Heidke skill score).
Between the three types of imbalanced regression loss investigated in this work, the inverse weighting method appears to strike the best balance between improved hit rate versus increased frequency bias and false alarm ratio. Not only does the inverse weighting method sample the target distribution more accurately (see Fig. 7), but frequency bias and false alarm rates are substantially less inflated than the SERA losses over all percentile thresholds between p 75 -p 99 , while hit rates nevertheless This discussion will finish by mentioning a number of possible extensions of this work. One disadvantage of utilising the entirety of available data in this work is that many of the input-target samples containing extreme winds are samples where extreme winds are present in both the input as well as the target. Examples where there are no extremes present in the input, 450 but the target is showing onsets of extremes, are disproportionately rare in the data although they clearly represent a more interesting problem for early-warning systems. Improving our model as an early-warning system of onsets of extreme winds may thus be obtained by focusing model-learning on precisely such training samples, rather than employing all available samples. To this end, it could be worthwhile to change the model into a nowcasting model, which would entail reducing the lead-time of the model to below 6 hours while increasing the temporal and spatial resolutions of the data, possibly by utilising 455 more precise ground data than raster data from satellites, as recommended by Amato et al. (2020).
This work may, furthermore, be extended by taking a multi-variate approach to wind speed forecasting whereby other atmospheric variables are included into the input of the model, which is an approach that is already being pursued in the community (see: e.g. Racah et al., 2017;Marndi et al., 2020;Xie et al., 2021). Marndi et al. (2020) suggest the utilisation of temperature, humidity and pressure, as Cadenas et al. (2016) has found these to be significantly more important than other 460 atmospheric variables to the task of wind forecasting. Xie et al. (2021) use these same three variables, as well as the 1-hour minimum and maximum temperature, while Racah et al. (2017) use a much larger set of 16 atmospheric variables, albeit for the classification of large-scale extreme weather events and not for regression of wind speed. It may also be worthwhile to consider other atmospheric variables such as the convective available potential energy (CAPE) and deep-layer wind shear (DLS) due to their strong correlation with severe convective storm activity such as the occurrence of thunderstorms and supercells (see: e.g. 465 Rädler et al., 2015;Tsonevsky et al., 2018;Chavas and Dawson II, 2021). Another possible extension would be to implement categorical scores directly into the loss function (see e.g. Lagerquist and Ebert-Uphoff, 2022) or even combine the ConvLSTM with a so-called physics-aware loss function (see e.g. Schweri et al., 2021;Cuomo et al., 2022).
Finally, it should be noted that while the ConvLSTM has proven itself to be highly effective at modelling complex spatiotemporal patterns, other models have since been proposed as promising improvements to the ConvLSTM for the task of 470 spatio-temporal sequence forecasting. Most notably, the PredRNN and its successor PredRNN++, proposed by Wang et al. (2017) and Wang et al. (2018), respectively, have been demonstrated to be superior to the ConvLSTM for the task of video frame prediction by maintaining a global memory state rather than constraining memory states to each ConvLSTM module individually. Other alternative approaches include the usage of functional neural networks (FNNs) (see: Rao et al., 2020) or generative adversarial networks (GANs) (see: Gao et al., 2020). Such models may be of interest to the meteorological 475 community pursuing data-driven, spatio-temporal forecasting.

Conclusions
In this paper a deep learning approach to the task of spatio-temporal prediction of wind speed extremes was explored and, in particular, the role of the loss function was investigated. To this end, a multi-layered convolutional long short-term memory (ConvLSTM) network was adapted to the task of spatio-temporal imbalanced regression by training the model with a number 480 of different imbalanced regression loss functions proposed in the literature: Inversely weighted loss, linearly weighted loss and squared error-relevance area (SERA) loss. The models were trained and tested on reanalysis wind speed data from the European Centre for Medium-Range Weather Forecasts (ECMWF) at 1000 hPa, providing multi-frame forecasts of horizontal near-surface wind speed over Europe with a 12 hour lead-time and in one hour intervals, using the preceding 12 hours as input.
By standardising the data based on the local wind speed distributions at each coordinate the definition of an extreme event 485 was focused on its relative rarity rather than its absolute severity, with extreme winds thus considered in terms of their local distributional percentile.
The model forecasts were analysed and compared with a variety of scores, over various lead-times and intensity thresholds.
After determining the optimal number of network layers for the ConvLSTM trained with the various different loss functions, an extensive comparison was made between the different loss functions and two baseline models trained with either mean 490 absolute (MAE) or mean squared (MSE) loss. The results show that the imbalanced regression loss functions investigated in this paper can be used to substantially improve hit rates and RMSE scores over the baseline models, however, at the cost of increased frequency bias and false alarm ratios. The SERA loss provides an extreme case of this behaviour, typically at the additional cost of reductions in threat score, although results are heavily dependent on the loss function's so-called controlpoints. The linear weighting method shows some ability to boost hit rates while keeping frequency bias and false alarm ratio 495 comparatively low, although the utility of the method is lost for extreme events beyond the 90 th percentile intensity threshold, with predictions heavily biased towards the median of the distributions rather than the right tail. Inverse weighting is concluded to strike the best trade-off between improved hit rates and sustained threat scores versus increased frequency bias and false alarm ratio, across various thresholds of extreme events up until the 99 th percentile intensity threshold -with the weighted MAE loss scoring slighly better than the weighted MSE loss. The inverse weighting method, furthermore, results in a better 500 sampling of the target distribution as compared with the linear weighting or the SERA loss. Out of the different imbalanced regression loss functions investigated in this work, the inverse weighting loss is thus concluded to be most effective at adapting the ConvLSTM to the task of imbalanced spatio-temporal regression and its application to the forecasting of extreme wind events in the short-to-medium range. With these results, this work is hoped to provide a valuable contribution to the area of deep learning for spatio-temporal imbalanced regression and its application to wind energy forecasting research.

505
Code and data availability. The current version of model is available at the project repository on Github: https://github.com/dscheepens/De ep-RNN-for-extreme-wind-speed-prediction under the MIT license. The exact version of the model used to produce the results used in this paper is archived on Zenodo (DOI: 10.5281/zenodo.7369015), as are scripts to run the model and produce the plots for all the simulations presented in this paper. The data used in this paper can be downloaded from the Copernicus Climate Change Service Climate Data Store (CDS) of the ECMWF (see Hersbach et al., 2018), where the reanalysis data of the U and V components of the horizontal wind velocity 510 were taken at 1000 hPa from the ERA5 hourly data on pressure levels from 1979 to present dataset between years 1979-2021 (42 years) and between 40-56 • N and 3-19 • E. Scalar wind speed was obtained by computing the square root of the sum of the squares of the two wind velocity components. Scripts to generate the data as such are available in the project repository.
to further research on renewable energy and meteorologically induced extreme events. Special thanks goes to Markus Dabernig and Aitor Atencia (Zentralanstalt für Meteorologie und Geodynamik, Vienna) who provided invaluable and extensive feedback to the paper at its final stages.