HIDRA2: deep-learning ensemble sea level and storm tide forecasting in the presence of seiches – the case of Northern Adriatic

. We propose a new deep-learning architecture HIDRA2 for sea level and storm tide modeling, which is extremely fast to train and apply, and outperforms both our previous network design HIDRA1 and two state-of-the-art numerical ocean models (a NEMO engine with sea level data assimilation and a SCHISM ocean modeling system), over all sea level bins and all forecast lead times. The architecture of HIDRA2 employs novel atmospheric, tidal and SSH feature encoders, as well as a novel feature fusion and SSH regression block. HIDRA2 was trained on surface wind and pressure fields from a single member of 5 ECMWF atmospheric ensemble and on Koper tide gauge observations. An extensive ablation study was performed to estimate individual importances of input encoders and data streams. Compared to HIDRA1, the overall mean absolute forecast error is reduced by 13 % , while on storm events it is lower by even a larger margin of 25 % . Consistent superior performance over HIDRA1 as well as over general circulation models is observed in both tails of the sea level distribution: low tail forecasting is relevant for marine traffic scheduling to ports of northern Adriatic while high tail accuracy helps coastal flood response. 10 Power spectrum analysis indicates that HIDRA2 most accurately represents the energy density peak centered on the ground state sea surface eigenmode (seiche) and comes close second to SCHISM in the energy band of the first excited eigenmode. To assign model errors to specific frequency bands covering diurnal and semi-diurnal tides and two lowest basin seiches, spectral decomposition of sea levels during several historic storms is performed. HIDRA2 accurately predicts amplitudes and temporal phases of the Adriatic basin seiches, which

The problem of sea level forecasting on the Northern Adriatic Shelf (see Fig. 1 for the shelf location and depth) is two-fold: 25 (i) high sea levels lead to severe flooding of densely populated coastal towns, while (ii) low sea levels may effectively inhibit large marine cargo due to very shallow depths (often below 15 meters) of marine waterways on the shelf and especially in the Gulf of Trieste. Reliable forecasting of both tails, high and low, of sea level distribution is therefore imperative for services like civil safety and cargo scheduling activities in local ports.
In this paper we will adhere to the terminology, proposed in (Gregory et al., 2019): (i) the term sea level will denote total 40 time-varying local water depth at the tide gauge in Koper, (ii) the term sea surface height is the height of sea level above (or below) the reference ellipsoid, (iii) the term storm surge denotes the elevation or depression of the sea surface with respect to the predicted tide during a storm, and (iv) the term storm tide is the sea surface height, elevated during a storm by a storm surge.
The key difficulty of sea level forecasting in the Adriatic basin arises from high sensitivity of total sea level to the phase lag 45 between the gravitationally generated tides (independent from meteorological forcing) and meteorologically generated basin seiches (independent from gravitational forcing). This sensitivity can translate reliable atmospheric forecasts with very limited errors in timing and trajectory of the cyclone into substantial errors in the sea level forecast. Probabilistic ensemble forecasting of sea level envelopes with error variance estimation (Žust et al., 2021;Ferrarin et al., 2020;Bernier and Thompson, 2015;Mel and Lionello, 2014) was therefore explored to tackle this drawback. However, ensemble sea level forecasting is numerically 50 expensive, requires specialized expensive hardware and introduces delays in prediction. To avoid the high numeric cost of ensemble sea level forecasting, computationally efficient machine-learning-based ensemble models have recently been explored (Žust et al., 2021). While these models require a substantial amount of training data to learn the complex relations for reliable predictions, the inference is numerically cheap. For example, single-point Koper sea level predictions from the neural network HIDRA1 ensemble (Žust et al., 2021) are a million times faster than the full-basin operational NEMO ocean (Madec, 2016) at 55 Slovenian Environment Agency. It is true that HIDRA1 computes prediction for a single variable in a single point, while ocean models compute 4-dimensional evolution of a broad set of oceanic properties but in the operational environment. Faster model prediction times however come with immediate benefits for downstream warning issuing and civil rescue operations.
Machine learning has thus been explored by several research groups for single-point sea level forecasting. The early approaches (Imani et al., 2018) were based on classic machine learning models such as support vector machines (Vapnik, 1999) 60 with radial basis function kernels. In their work, Pashova and Popova (2011) and Karimi et al. (2013) utilized shallow fully connected neural networks, but due to simplistic network architectures that did not utilize the numerical atmospheric forecast, they could only report the desired accuracy for short temporal horizons. Ishida et al. (2020) used long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks together with several atmospheric variables to improve one-hour prediction into the future but did not expand the prediction horizon. Braakmann-Folgmann et al. (2017) predicted further in time version, HIDRA2 presents a novel architecture with new atmospheric, tidal and SSH feature encoders, as well as a novel feature fusion and SSH regression block. An additional conceptual novelty is that HIDRA2 predicts the full SSH rather than the residual (i.e., the difference between SSH and astronomic tide) as is the case for HIDRA1. The new model extracts relevant information from different spatial locations in the atmosphere signal and predicts the SSH with a three-days horizon at an unprecedented accuracy, outperforming HIDRA1 as well as two state-of-the-art ocean models.

80
The paper is organized as follows. Section 2 introduces sea level and atmospheric model data and performance measures used in our analysis. Section 3 details the new HIDRA2 architecture and the numerical ocean model setup, used as the performance baseline. Section 4 reports the analysis of the HIDRA2 architecture (including an extensive ablation study) and provides detailed quantitative as well as qualitative comparisons with the state-of-the-art numerical ocean models. Conclusions and outlook are drawn in Sect. 5.  (Pérez Gómez et al., 2022). Observations are obtained in 10-minute intervals, using both a 90 float-type sensor and an additional radar sea level sensor, and they undergo subsequent quality control at ARSO (Pérez Gómez et al., 2022). Hourly data points are extracted to get the signal used in HIDRA2 training and evaluation.
The tidal signal in the sea level is independent of atmospheric processes and can be computed by tidal analysis and prediction models. The tidal contribution to Koper SSH considered in this study is estimated from hourly instantaneous SSH values in one-year segments using the UTIDE Tidal Analysis package for Python (Codiga, 2011). The total sea level series is then 95 decomposed into a tidal and a residual part, where we define the sea level residual as the arithmetic difference between the total sea level and the tidal sea level. According to the ARSO operational protocol, the SSH is classified as a flood if it is higher than 300 cm. Floods thus constitute 0.41 % of all training data.

Atmospheric training data
Atmospheric input used for HIDRA2 training was retrieved from the European Centre for Medium-Range Weather Forecasts 100 (ECMWF) Ensemble Prediction System (Leutbecher and Palmer, 2007). The ECMWF ensemble forecasts come as a global atmospheric ensemble of 50 members with a 0.125 • arc degree spatial resolution and a 3-hour temporal resolution. The training dataset used in this study consists of (i) 10-meter winds and (ii) mean sea level air pressure from a single fixed (42nd) atmospheric ensemble member during the period 2006-2018. Number 42 was chosen randomly to the extent that it is a tribute to the ultimate answer from the Hitchhiker's Guide to the Galaxy (Adams, 1979). Of course, over multi-year time intervals, 105 Figure 2. HIDRA2 input domain and dataset. The leftmost panel depicts ECMWF grid (white dots) and Koper tide gauge location (red circle). Three panels on the right depict snapshots of ECMWF atmospheric fields used during training. this member is completely statistically equivalent to random use of any other member of the ECMWF ensemble prediction system. In other words, we could use any other ensemble member -or choose a different random member each run -without substantially affecting the results. All ECMWF input fields were standardized and cropped to the Adriatic basin, represented by a 57 × 73 spatial grid (see Fig. 2). The forecasts were linearly interpolated to hourly timesteps to match the SSH temporal resolution. To simplify the training protocol, a single atmospheric sequence is constructed by concatenating the first 24 hours 110 of daily consecutive ECMWF forecasts into the final atmospheric sequence used in training. HIDRA2 does not require explicit annotation of whether a location point belongs to land or sea, thus land masks are not generated.

Evaluation data
The evaluation input dataset for both HIDRAs and NEMO is disjoint from the training dataset (years 2006-2018) and consists of ECMWF atmospheric predictions and Koper sea levels between 01 June 2019 and 31 December 2020. This period was 115 chosen due to challenging conditions and unusually high incidence of floods. We use the ECMWF daily predictions, each containing 50 ensemble members with three-days prediction lead time. The data are standardized and the dimensionality of the atmospheric data is reduced in the same fashion as described in the Sect. 2.2, except that for inference, the full (i.e. containing all ensemble members) ECMWF three-day forecast is presented to the model. The floods represent 1.1 % of the test dataset.

120
Standard measures, i.e., the mean absolute error (MAE), the root mean square error (RMSE) and the model bias are used to evaluate prediction performance. To reflect the practical suitability, we additionally calculate the prediction accuracy as a ratio between the predictions which are within 10 cm from the ground truth and all predictions. This 10 cm threshold reflects an acceptable deviation from the ground truth and was determined through discussion with the operational forecasting service at the prediction performance at these critical rare events.
To further probe the flood event prediction capabilities, we make use of the standard performance measures from detection literature: precision Pr , recall Re and the F1 measure F1 . Firstly, we need to define the flood event and then define the notion of the event being detected. Both of these have been defined in discussion with operational forecasters at ARSO. The anchor (i.e., temporal point) of a flood event is defined as the time of the local maximum in an SSH sequence above 300 cm. If the 130 predicted flood event anchor is within a 3 h margin (before or after) from the nearest ground truth flood event anchor, it is considered a true positive TP , otherwise it is a false positive FP . A flood event in the ground truth is considered a false negative FN if there is no matching flood event anchor in the predicted SSH. Like in the accuracy definition, the tolerance of 10 cm is applied, meaning that predictions below 300 cm are also considered as TP when they appear within the margin of 10 cm, and that false positives with ground truth within 10 cm are ignored. The precision and recall are then calculated as while the F1 measure that summarizes the detection performance, i.e., is defined as the harmonic mean between precision and recall.

HIDRA2
The proposed HIDRA2 is the second generation of a deep neural model for sea surface height prediction, with HIDRA1 (Žust et al., 2021) being the first. The new architecture is shown in Fig. 3. The input data is encoded by three encoders: the wind and pressure sequences for the past 72 h are processed and merged by the Atmospheric encoder (Sect. 3.1.1), the tidal signal for the future 72 h is encoded by the Tidal encoder, and the sea surface height measurements coupled with the tide for the past The Atmospheric encoder is composed of two stages. In the first stage (shown in Fig. 4), the sequences of the wind and pressure images are independently processed by their respective encoding blocks, which use the same architecture. The wind 155 image sequence of 96 h (the past 24 h and future 72 h) is divided into 24 groups of four consecutive hours, which are processed independently. The spatial and temporal dimension of each group is reduced by a learnable 2D convolutional layer with a 3 × 3 kernel, stride 2 and 64 output channels 1 . A ReLU activation and Dropout layers are applied, followed by a convolutional layer with 512 4 × 5 kernels, which are by size equal to the input, meaning that convolution is essentially a dot product between each kernel and the input. The operation yields a higher value if kernel is similar to the input, so we refer to it as a prototype 160 matching layer. It extracts features from different spatial positions, thus producing a 512-dimensional feature vector per group, i.e., 24 temporal vectors of size 512. The same processing architecture is applied to the pressure image sequence to produce 24 vectors of size 512. The two outputs are then concatenated to form a mixed set of 24 wind-pressure features of size 1024.
The second stage of the Atmospheric encoder ( Fig. 5) extracts the temporal atmospheric features by considering the consecutive wind-pressure features extracted by the first stage. A 1D convolutional layer with the kernel temporal dimension size 165 5 and with 256 output channels 2 is applied, entangling the information from temporal segments equivalent to 20 h in length.
Note that because we are using a convolutional layer instead of the fully connected layer, the number of learnable parameters of the entire Atmospheric encoder is independent of the forecast horizon. Each of the obtained 20 features 3 is then independently Figure 4. The first stage of the Atmospheric encoder. The input are four consecutive hours with two wind channels (this case), or pressure.
3 × 3 convolution is applied, followed by a prototype matching layer, outputting a single vector of size 512. Note that 24 independent passes are performed in parallel for the entire atmospheric input sequence. The variables k and n denote the kernel size and the number of output channels, respectively. processed by a network containing two blocks of residual connections, each involving 1D convolution with kernel temporal dimension size 1 (i.e., 1 × 256 kernels), a SELU activation (Klambauer et al., 2017) and a dropout layer. Finally, each of the 170 obtained twenty 256-dimensional features are convolved by 32 1 × 256 kernels to reduce their dimensionality to 20 × 32.

Tidal and SSH encoders
Both the tidal and SSH encoders use the same architecture, the only difference is in the size of the encoders' input. Figure 6 depicts the SSH encoder, which takes as the input the past 72 h SSH measurements and the tide concatenated into a 72 × 2 input tensor and processes them in a similar fashion to the second stage of the Atmospheric encoder: the input is processed 175 by convolution with 256 3 × 2 kernels, which is followed by two consecutive residual blocks, a subsampling max-pooling layer and a final convolutional layer to reduce the dimensionality of features to 17 × 16. The Tidal encoder follows the same architecture, the only difference is that the input is the tide forecast for the next 72 h. Figure 6. The SSH encoder encodes a concatenation of the past SSH and tide by a 1D convolution, followed by two blocks with residual connections, max-pooling temporal reduction and convolution-based feature reduction. The variables k and n denote the kernel size and the number of output channels, respectively.

Fusion-regression block
Atmospheric, Tide and SSH encoders produce temporal features of different importances and sizes. To account for that, the

The network training
HIDRA2 is trained end-to-end using mean squared error (MSE) loss between the predictions and the ground truth. We train the model using the AdamW optimizer (Loshchilov and Hutter, 2017) with standard parameter values (learning rate 1e−4, and the running average damping parameters β 1 = 0.9 and β 2 = 0.999), and apply the cosine annealing (Loshchilov and Hutter,195 2016) learning schedule to gradually reduce the learning rate during training to 1e−5. The training batch size is set to 512 data samples, and the model is trained for 40 epochs. Training takes approximately 1.5 hours on a single computer with NVIDIA Geforce RTX 2080 TI graphics card, while the inference of a single 72 h prediction for one member of the atmospheric ensemble takes only 4 ms. While there are many differences between HIDRA2 and HIDRA1, we summarize only the major conceptual ones for a clearer exposition of the contributions. HIDRA1 uses wind, pressure and 2 m temperature from ECMWF predictions, while our preliminary study showed that the new HIDRA2 architecture does not benefit from the temperature, thus only wind and pressure are considered. HIDRA1 concatenates all atmospheric inputs at a timestep and encodes them by Resnet (He et al., 2016) blocks.
While Resnet excels in computer vision tasks that rely on high-level semantic feature abstraction, we argue that tailored shal-205 lower encoders are more appropriate for the extraction of meaningful atmospheric patterns. HIDRA2 thus separately encodes the wind and pressure by shallow encoders, which apply spatial pattern features extraction, and then mixes the features from the two atmospheric variables by extracting temporal patterns. While this allows HIDRA2 to extract multiple spatial patterns in the data, only a single set of spatial weights is used to fuse the atmospheric features at a given time-step in HIDRA1, consequently reducing its expressive power. HIDRA1 first averages four-hour atmospheric input data to temporally subsample the input, 210 while HIDRA2 considers per-hour inputs and learns the appropriate spatio-temporal subsampling to maximize its predictive power. Another advantage of HIDRA2 is that it encodes the SSH input and mixes it with the atmospheric features early in the network to create a domain context feature vector before the final regression, while HIDRA1 considers only the atmospheric data for the context vector. Finally, HIDRA1 predicts the SSH residual (i.e., the difference between SSH and the astronomic tide), while HIDRA2 directly predicts the full SSH.

Ocean models
In this section we briefly describe two different numerical ocean modeling setups used for benchmarking HIDRA2. The two setups differ in several important respects. One is based on NEMO ocean engine (Madec, 2016) and the other on SCHISM (Zhang et al., 2016) modeling environment. NEMO setup is described in more detail in Sect. 3.2.1 and it is a forecasting setup.
SCHISM setup is described in Sect. 3.2.2 and it is a reanalysis setup (Toomey et al., 2022). For brevity we will refer to the two

NEMO ocean model
The Copernicus Marine Environment Monitoring Service (CMEMS) product MEDSEA_ANALYSISFORECAST_PHY_006-_013 (see Clementi et al. (2021)), was used as one of two numerical baselines for HIDRA2. This product is based on a Mediterranean basin configuration of the NEMO ocean model (Madec, 2016), and provides daily ocean forecasts for sea surface height 225 above the geoid, temperature, salinity, circulation and mixed layer depth. The model domain spans the entire Mediterranean basin with a (1/24) • resolution and has 141 unevenly spaced vertical levels. The model solutions are operationally constrained to near-real-time observations using a 3D variational assimilation scheme of temperature, salinity and along-track satellite sea level anomaly observations. The atmospheric forcing to the CMEMS model is provided by the ECMWF. Further details about the modeling setup can be found in (Clementi et al., 2021). In this study, an SSH timeseries at the closest point to Koper tide 230 gauge was extracted from the Mediterranean ocean model forecast.

SCHISM ocean model
A barotropic setup of SCHISM storm surge and wind-wave modeling environment (Toomey et al., 2022) was used as a second numerical baseline for HIDRA2. In this study, a single SSH timeseries from SCHISM reanalysis (Toomey et al., 2022) at the closest point to Koper tide gauge was extracted and used for comparisons to HIDRA models. SCHISM runs on an unstructured 235 mesh covering the entire Mediterranean basin and extending into the Atlantic ocean in the west. Its lateral boundary is forced by an equilibrium inverted barometer ocean response to atmospheric pressure, while its surface forcing consists of ERA5 surface fields. SCHISM unstructured grid allows for very high coastal resolutions, reaching some 200 m close to the coast.

Ocean model offset adjustment
Both NEMO and SCHISM sea levels, denoted here jointly as y model , at any given location reflect departures from the local 240 geoid and hence do not represent the absolute local depth of the water. The latter is furthermore also driven by low-frequency processes on the scales of many weeks or months which are often difficult to capture for regional basin models on synoptic timescales. Prior to benchmarking, model results therefore have to be offset-adjusted to obtain total sea levels (required by port authorities and civil rescue) as follows. A time-averaged model (NEMO or SCHISM) SSH offset ϵ n on the n-th hour of the forecast day is defined as where y kp (t) is the observed Koper sea level.
Each day, the value of ϵ 12 is subtracted from y model (t) to ensure a zero bias for the first 12 h of the model day. Note that, despite this adjustment, the complete 72-hour modeled time-series may still exhibit a non-zero bias. Similar offset adjustment is not required for HIDRAs, since they predict the full SSH and learn to appropriately adjust for the offset automatically. for predicting floods, while incurring only a small drop in the overall performance.
A possible explanation of this somewhat surprising behavior could perhaps be related to nonlinear interactions between tides and storm surges: both tides and storm surges modify local water depth which impacts their own barotropic wave propagation speeds and topographic amplifications, which ultimately define the onset time and the amplitude of any coastal flood in Koper.
Such interactions are non-existent during calm conditions but they do play a role during stormy periods (Ferrarin et al., 2022).

270
Perhaps HIDRA2 is able to anticipate certain aspects of nonlinear tide-surge couplings. This explanation is also consistent with the fact, detailed in Section 4.1.2, that among all atmospherically driven models the de-tided version HIDRA2 res shows the worst performance during storm tide events, while versions incorporating tides come closest to HIDRA2 (see Fig. 8).

Ablation study: the importance of encoders and input data
An ablation study was executed to evaluate the importance of individual encoders and input data types. To estimate encoder served as an independent validation set. Note that regardless of the encoder input, HIDRA2 always receives unencoded SSH data directly into its fusion-regression block (bottom dataflow branch in Fig. 3).

280
The following encoder ablations were performed: 1. Removal of the Atmospheric encoder (HIDRA2 \atmE ). Network HIDRA2 \atmE obtained no atmospheric input data but it did receive SSH and tidal data.
2. Removal of the Tidal encoder (HIDRA2 \tidE ). Network HIDRA2 \tidE obtained no tidal input data for tidal encoding but it did receive SSH and tidal data through the SSH encoder and atmospheric data through the Atmospheric encoder.

285
3. Removal of the SSH encoder (HIDRA2 \sshE ). Network HIDRA2 \sshE received atmospheric and tidal data through Atmospheric and Tidal encoders, but it did not receive any SSH input via the SSH encoder.
The results in the Table 1 show that MAE increases with each modification, particularly during storm events. Removal of the Atmospheric encoder results in the most significant performance drop, indicating that the atmospheric features convey by far the most relevant predictive information. A significant performance drop is observed as well when removing the Tidal encoder.

290
The SSH encoder has the smallest impact on overall performance, yet still importantly contributes to the prediction accuracy during storms.
Two further ablations were then performed regarding the data types of the sea level input data (the SSH and the tide, see Fig. 3) which are considered in the SSH encoder. We retained HIDRA2 with all three of its encoders but provided the SSH encoder with limited sea level input: 295 1. Removal of the tidal input to the SSH encoder (HIDRA2 \tidI ). In this case the SSH encoder received as input only total sea level.
2. Removal of the SSH input (HIDRA2 \sshI ). In this case the SSH encoder received as input only tidal sea level. The results in Table 1 show that the removal of each leads to a consistent but moderate increase of the errors overall.
However, the errors increase substantially during storms, indicating the importance of using both types of inputs. 300 We observe a similar situation when removing the atmospheric and SSH/tide feature re-calibration in the Fusion-regression block (HIDRA2 \norm ). Results in Table 1 indicate that feature normalization does not affect performance in normal conditions, but it substantially contributes to the prediction accuracy of storm tides. A closer inspection of HIDRA2 \norm showed that the scale of the tidal features is four times larger than the scale of the atmospheric features. Inclusion of the re-calibration blocks, however, remedies this by making the scales of all features (atmospheric, SSH and tidal) approximately the same.  Table 1) are the best, Fig. 8 indicates that for low sea levels, HIDRA2 res exhibits slightly lower errors. HIDRA2, however, performs substantially better in the flooding regime above 300 cm. This further substantiates our final choice of the HIDRA2 architecture.

310
HIDRA2 is compared with HIDRA1 (Žust et al., 2021), which is currently the state-of-the-art in machine-learning SSH prediction (Sonnewald et al., 2021), and with state-of-the-art numerical ocean modeling setups NEMO (Madec, 2016) and SCHISM (Toomey et al., 2022). The methods are evaluated on an independent time window (  We next analyzed how the prediction lead time affects the prediction errors. Figure 10 shows the MAE scores with respect to the prediction lead time for the values between 1 h and 72 h. The MAE of the prediction gradually increases with the lead time

Spectral analysis
To investigate the spectral properties of the modeled and observed SSH timeseries, we computed spectral densities of the HIDRA2, HIDRA1, NEMO and SCHISM predictions. Unless otherwise stated, all time-series analyzed in this section were obtained by concatenating (in time) the first 24 hours of each daily HIDRA2, HIDRA1 and NEMO three-day forecast. Spectral densities (shown in Fig. 11) were then computed as absolute values of a 1D Fast Fourier Transform of the respective series 350 over a fixed frequency domain of (1 h) −1 − (72 h) −1 . Figure 11 indicates that all methods adequately represent the tidal dynamics in Koper. The energy content around the two lowest basin eigenmodes is, however, more discriminatory: NEMO (Fig. 11, left panel) clearly underestimates the spectral density both around the ground state seiche (at 21.5 h period) and around the first excited state (10.9 h period). Similar behavior was noticed in our previous work with an independent configuration of NEMO (Žust et al., 2021). SCHISM, on the 355 Figure 10. MAE score of HIDRA, NEMO and SCHISM models with regard to prediction lead time (between 1 h and 72 h). Performance of all models was evaluated on a 01 June 2019-31 December 2020 dataset, which is completely independent from the training data. other hand, overestimates the energy in the ground state seiche band, but reproduces the first excited state energy very well ( Fig. 11, right panel). HIDRA1 underestimates the energy of this part of the signal as well, but nevertheless does a bit better by packing more energy density in these two bands. Predictions of HIDRA2 are clearly the closest to the observations in the ground state seiche band, but come close second to SCHISM around the 10.9 h period.
It appears that HIDRA2 is capable of generating a seiche-like behavior in its predictions. Spectral density, however, discards 360 the temporal component of the signal, and adequate spectral density in the (21.5 h) −1 and (10.9 h) −1 frequency bands says little about whether Adriatic seiches are generated by HIDRA2 at the appropriate times, namely during the storms. To inspect this aspect of HIDRA2 behavior, we now proceed to analyze the predictions during several historic storm tides.  Historic Adriatic storm tide events are used to qualitatively compare the HIDRA2 performance with the state-of-the-art.

365
Storm tides in question occurred during November and December 2019 and were of historic proportions by any criterion. The Slovenian coast was flooded over ten times in a single month and sea levels in Venice were among the highest ever observed.
Furthermore, the events in November 2019 turned out to be difficult to model due to the formation of a transient and very localized low pressure over the Gulf of Venice, which went unresolved in most models (Cavaleri et al., 2020). These events, along with those from December 2019, therefore represent a highly challenging benchmark for any atmospheric model and   None of the models successfully predicted the first and highest sea level peak on 12 November 2019, but HIDRA2, NEMO and SCHISM all give a better forecast than HIDRA1 whose mean sea level does not even surpass the flooding threshold. As noted in Cavaleri et al. (2020), this peak was difficult to forecast due to the delicate timing between the peak of winds and the peak 375 of the full moon tide, combined with the formation of an unresolved local pressure disturbance over the west coast of Northern Adriatic. Relative timing of these influences turned out to be a sine qua non for a successful prediction -neither the winds nor pressure were, in themselves, in any way extraordinary. It is further shown in Cavaleri et al. (2020) that this particular storm tide could have been up to 25 cm higher had this scenario evolved 12 hours earlier when tidal peaks were themselves higher.
The peak on 13 November is slightly better predicted by maximum members of both HIDRAs than by NEMO or SCHISM, 380 with HIDRA2 exhibiting a somewhat lower forecast spread than HIDRA1. Apart from this peak, all models captured the sea level variability quite well, which is in itself an implicit testament to the high skills of ECMWF atmospheric products.
Floods of December 2019 are another example of HIDRA2 superior performance over HIDRA1 and both ocean models in Koper. SSH observations and predictions in Koper during this period are depicted in Fig. 13  vations and exhibits a substantially lower forecasting spread than the HIDRA1 ensemble. Low forecasting spread is acceptable when in conjunction with a well-behaved ensemble mean. In this case, the HIDRA2 ensemble mean is in excellent agreement with the observations. The same could be said for HIDRA1, albeit to a lesser degree. NEMO, however, completely misses the first two peaks between 15 and 17 December, slightly underestimates (like HIDRA1) the highest peak on 23 December, and overall underestimates the minimum-maximum range of the sea level variations, corresponding to poorly predicted ebb levels 390 after 23 December. SCHISM predicts the first two peaks but underestimates the peaks after 23 December. The vertical sea level range is much better captured by both HIDRAs, especially by HIDRA2. This result is consistent with our demonstration that HIDRA2 exhibits the lowest error in both the high and the low tail of sea level distributions (Fig. 11).
To inspect the behavior of the ensemble forecast spread, three timeseries were created from daily (72 h long) forecasts during evaluation time window between 01 June 2019 and 31 December 2020. The first timeseries was constructed by concatenating 395 each first day (i.e., 1-24 h of forecast) from each of the daily forecasts, thus containing predictions with lead times of 1-24 h on each respective day in the evaluation time window. The second and the third timeseries were constructed by concatenating 25-48 h (49-72 h) of forecast on each respective day in the evaluation time window. All three timeseries for the December is growing with forecast lead time as well. As we draw closer to a particular flooding event, the forecast spread drops, indicating 400 an increased prediction certainty.

Spectral decomposition of forecasts during storms
To investigate the performance in geophysically relevant energy bands, we band-pass filtered the observed and the predicted SSH signals in energy bands, centered around four important periods: semi-diurnal tide (12 h period), diurnal tide (24 h period), fundamental basin along-axis eigenmode (21.5 h period) and first excited along-axis eigenmode (10.9 h period).

405
Although incomplete, this SSH decomposition allows qualitative estimation of the excitation intensity of the basin eigenmodes during a particular storm, and also helps to qualitatively assign forecasting errors to specific frequency bands. However, since the amplitudes of filtered signals in Fig. 15 directly depend on the filter bandwidths, they should not be interpreted as direct contributions to the sea level due to respective geophysical phenomena (i.e., two tidal signals, two eigenmodes). They should rather be read strictly as an additional insight into the model performance within a specific band with reference to 410 filtered observations in the same band.
We applied a fifth-order Butterworth band-pass filter with the sampling rate of (1 h) −1 . Low and high cutoff frequencies, which define the semi-diurnal filtering band ∆ω 12 , were set to ∆ω 12 = [(12. Identical analysis and related figures for the SCHISM model are available in the supplementary material to this paper. They illustrate that SCHISM exhibits very solid performance in the seiche energy bands.
All models exhibit an underestimation of the amplitude but are otherwise in phase with the observations in the ∆ω 12 band.

420
In ∆ω 24 , NEMO seems to be performing very well, with HIDRA2 slightly underestimating the range of the signal in this band.
In the band ∆ω 21.5 NEMO is again closest to filtered observations while both HIDRA models overpredict the vertical range of the observed signal. Band ∆ω 10.9 is underpredicted in all models, but seems best (or rather least poorly) resolved by HIDRAs, with NEMO additionally exhibiting a substantial phase shift in the signal.
In any case, since both tidal bands and the ground state seiche are reliably predicted by all models, the reason for the 425 forecasting errors must lie in the higher frequency bands with periods below 10.9 hours. This seems consistent with the occurrence of highly transient and localized low pressure over Venice mentioned in Cavaleri et al. (2020) and will be the subject of further research.
Similar remarks can be made regarding the December 2019 coastal flooding, depicted in Fig. 16. This event marked a suboptimal performance of NEMO, which is systematically underestimating SSH peaks and the overall vertical range of the SSH 430 variability during this time window (Fig. 16,   The second and third panels in Fig. 16 demonstrate that all models are reliable in the diurnal tidal band ∆ω 24 but that HIDRA2 overestimates the signal in ∆ω 12 . Since the overall performance of HIDRA2 is the best of all three models, it is 435 unclear whether overshoots in ∆ω 12 could be interpreted as compensations for the underestimations in the nearby ∆ω 10.9 band. The bottom two panels in Fig. 16, however, indicate that part of the modeling errors stem from their underestimation of the basin seiches. In the ∆ω 21.5 band, the HIDRA2 predictions most closely resemble the observations, followed by HIDRA1 and then NEMO (which is most severely underestimating this part of the signal). HIDRA2 is also the most reliable method in ∆ω 10.9 -but 440 it nevertheless systematically underestimates the observations. HIDRA1 and NEMO performances are significantly worse,  reaching one-half of the amplitude of HIDRA2 and one-third that of the observations. Poor performances of HIDRA1 and NEMO in ∆ω 21.5 and ∆ω 10.9 bands are simply another reflection of the fact depicted in Fig. 11, namely that both of these models struggle to generate an appropriate amount of energy in the bands around free oscillation eigenmodes.

445
This study presents a deep-learning based sea level model HIDRA2, suitable for operational sea level ensemble modeling due to its speed and accuracy. This work is a conceptual continuation of our previous attempt at sea level forecasting (Žust et al., 2021) and represents a substantial advancement over the first version (HIDRA1), setting a new state-of-the-art in machine learning SSH forecasting. The new architecture is validated by extensive ablation studies. The performance is benchmarked against the current state-of-the-art Mediterranean forecasting setup of NEMO ocean model (available as part of Copernicus 450 Marine Service) and against a multi-decadal reanalysis run of the SCHISM model (Toomey et al., 2022) on an unstructured grid with very high coastal resolution. We demonstrate that HIDRA2 outperforms HIDRA1 as well as numerical ocean models across all sea level bins. We further show that HIDRA2 very accurately represents the energy contents in the bands around relevant geophysical periods (diurnal and semi-diurnal tides, and the lowest two free oscillation basin eigenmodes).
Performance is analyzed during several historic storms. Spectral decomposition of the total sea level signal into bands 455 centered around tides and basin seiches is carried out to assign modeling errors to specific energy bands of the predicted sea levels. HIDRA2 consistently exhibits high skill in exciting the ground state Adriatic basin seiche at the appropriate time and with the appropriate phase and amplitude.
HIDRA2 is a good example of how the entanglement of deep-learning and geophysics may lead to reliable and numerically cheap models, which are able to mimic complex physical phenomena on the level of the best numerical physical models. Nev-460 ertheless, several extensions could be additionally explored. One possible extension is data ingestion from several tide gauges along the Adriatic coast and verification of whether the prediction accuracy at individual locations improves in such a multipoint prediction setup. Another extension is the inclusion of real-time in-situ measurements such as synoptic observations, satellite scatterometer and wind measurements. It would be interesting to migrate HIDRA2 to other Mediterranean locations or other semi-enclosed basins like the Baltic Sea, the Red Sea or the Chesapeake Bay to investigate its generalization properties.

465
These will be objects of our future research.
Code and data availability. Implementation of HIDRA2 and the code to train and evaluate the model is available in the Git repository https://github.com/rusmarko/HIDRA2 (last access: 9 November 2022). We also include HIDRA2 weights pretrained on 2006-2018 and predictions for all 50 ensembles on June 2019-December 2020. The persistent version of HIDRA2 source code is available at https:// doi.org/10.5281/zenodo.7307365 (Rus et al., 2022a). Training and evaluation of the model were performed on the datasets available at