HIDRA 1.0: Deep-Learning-Based Ensemble Sea Level Forecasting in the Northern Adriatic

Interactions between atmospheric forcing, topographic constraints to air and water flow, and resonant character of the basin make sea level modeling in Adriatic a challenging problem. In this study we present an ensemble deep-neuralnetwork-based sea level forecasting method HIDRA, which outperforms our setup of the general ocean circulation model ensemble (NEMO v3.6) for all forecast lead times and at a minuscule fraction of the numerical cost (order of 2× 10−6). HIDRA exhibits larger bias but lower RMSE than our setup of NEMO over most of the residual sea level bins. It introduces a 5 trainable atmospheric spatial encoder and employs fusion of atmospheric and sea level features into a self-contained network which enables discriminative feature learning. HIDRA architecture building blocks are experimentally analyzed in detail and compared to alternative approaches. Results show the importance of sea level input for forecast lead times below 24 h and the importance of atmospheric input for longer lead times. The best performance is achieved by considering the input as the total sea level, split into disjoint sets of tidal and residual signals. This enables HIDRA to optimize the prediction fidelity with respect to 10 atmospheric forcing while compensating for the errors in the tidal model. HIDRA is trained and analysed on a ten-year (20062016) timeseries of atmospheric surface fields from a single member of ECMWF atmospheric ensemble. In the testing phase, both HIDRA and NEMO ensemble systems are forced by the ECMWF atmospheric ensemble. Their performance is evaluated on a one-year (2019) hourly time series from tide gauge in Koper (Slovenia). Spectral and continuous wavelet analysis of the forecasts at the semi-diurnal frequency (12 h)−1 and at the ground-state basin seiche frequency (21.5 h)−1 is performed. The 15 energy at the basin seiche in the HIDRA forecast is close to the observed, while our setup of NEMO underestimates it. Analyses of the January 2015 and November 2019 storm surges indicate that HIDRA has learned to mimic the timing and amplitude of basin seiches.

atmospheric cyclones which manifest themselves as substantial air pressure lows and related winds over the basin. Reliable 25 and timely sea level forecasting is thus a crucial element of early warning systems for mitigating the flooding consequences.
Adriatic Sea has an elongated basin with northwest-southeast orientation, lies in the Northern Central Mediterranean, and connects to the eastern Mediterranean basin through the Otranto strait at its southern end (see Figure 1). The basin lies embedded between the Alps (to the north), the Apennines (to the west) and Dinaric Alps (to the east) -it spans a 800 km long and 200 km wide area. The ridges significantly influence the basin circulation through topographic control of the air flow, especially 30 during wind events of Bora (northeasterly wind) and Scirocco (southeasterly wind). Scirocco, predominantly directed along the basin long axis, is among the main drivers of the Adriatic storm surges. Northern Adriatic shelf is closed at its northern end and is the shallowest part of the Adriatic basin. Storm surges in this part are consequently most pronounced, causing substantial coastal flooding, inundation and erosion .
The elongated basing results in seiches with a fundamental period of 21.5 h (and first excited mode period of 10.9 h) 35 (Cerovecki et al., 1997;Medvedev et al., 2020). Diurnal and semi-diurnal tides enter the Adriatic through the Otranto Strait and force the basin close to its eigen-periods (Medvedev et al., 2020). Both seiches and tides are resonantly and topographically amplified in the shallow northern Adriatic. In Northern Adriatic, crest-to-trough range of tidal sea level movement can reach up to 1.5 m (and up to 10 cm/s in current speed (Cosoli et al., 2013)). In absence of strong winds, this signal dominates the sea level dynamics. During storm surges however, tides are generally too weak to be dominant but remain a crucial part of the total 40 sea level signal. Moreover, since Adriatic seiches decay on scales of several days, seiches and tides may continue to reinforce (or diminish) each other for several days.
The resonant character and high sensitivity to temporal phase lag between tides and seiches make Adriatic sea levels prediction a challenge for deterministic numerical ocean models. Even modest modeling errors in the timing, intensity or trajectory of an atmospheric cyclone, often lead to substantial modeling errors in the predicted sea level. Introduction of atmospheric 45 numerical forecast ensembles has thus enabled implementation of operational sea surface height forecast systems that yield probabilistic forecasts along with error variance estimation, which show promise world-wide, see i.e. Bernier and Thompson, 2015;Mel and Lionello, 2014;Bertotti et al., 2011).
These systems, however, often involve a high computational cost, usually requiring running tens of runs of basin scale numerical ocean models (each forced by a different member of the atmospheric ensemble) each day (or even several times 50 per day). Ensemble numerical modeling is therefore prohibitively demanding for many operational or civil rescue services that lack access to dedicated high-performance computing facilities.
Machine-learning-based ensemble modelling offers a possible solution to the challenges described above. Even though training a machine-learning model may involve substantial amount of training data and computational resources, the subsequent forecasting -even ensemble forecasting -is numerically cheap enough to be executed in some cases instantaneously on a 55 standard personal computer. Early approaches employ classic machine learning methods (Imani et al., 2018) or shallow fullyconnected neural networks (Pashova and Popova, 2011;Karimi et al., 2013) for daily or hourly sea level forecasting. The reported results show promise in sea level prediction, but fall short with simplistic architectures that ignore the atmospheric forcing. Ishida et al. (2020) attempt to improve the dynamics of hourly scale sea level forecasts by using a long short-term memory (LSTM) network, which are well established methods for sequence modelling and time-series prediction. The network 60 considers several atmospheric variables (wind speed and direction, sea level pressure, air temperature) as well as relative positions of the Sun and the Moon and annual global air temperatures as the input. Empirically, the short-term sea level prediction of one hour in the future is improved compared to older approaches. While predictions farther into the future could be achieved by iterative auto-regression, the errors would likely exponentially increase. A longer prediction horizon is considered by Braakmann-Folgmann et al. (2017), who apply a combination of recurrent neural networks (LSTMs) and 65 convolutional neural networks to model monthly spatial sea level time series. This is one of the first works that considers both spatial and temporal aspects of the input data, however, predictions are made at a much coarser temporal resolution than that considered in this paper. A higher temporal resolution is considered by Hieronymus et al. (2019), which is the most closely related work to ours. Autoregressive neural networks are used to model the sea level time-series with addition of atmospheric forcing reduced by empirical orthogonal function (EOF) decomposition. For northern Adriatic, Venice lagoon specifically, 70 artificial neural networks have already been shown to be useful at modelling numerical model errors in sea level prediction (Bajo and Umgiesser, 2010).
In this work we propose HIDRA -a HIgh-performance Deep tidal Residual estimation method using Atmospheric data.
HIDRA is a novel deep learning architecture which employs tidal and atmospheric forcing contributions for accurate sea level predictions. The model is trained end-to-end with discriminative feature extraction as part of the learning to maximize the 75 forecast accuracy and to compensate for the inaccuracies of the astronomical tide estimates. HIDRA is benchmarked against the operational setup of NEMO v3.6 general circulation model engine (Madec, 2008), which is run daily as part of the National Hydrological Forecasting Service at the Slovenian Environment Agency (ARSO). For brevity, we refer to this particular setup (see the Code and data availability Section for a detailed configuration namelist) as the NEMO model.
The remainder of the paper is structured as follows. The input sea level and atmospheric data, and the datasets used in 80 this study are described in Section 2. The HIDRA architecture, our NEMO ocean model setup and the ensemble structure are presented in Section 3. Section 4 reports a detailed analysis of the HIDRA architecture and empirical comparison to the NEMO system on a challenging setup. Conclusions are drawn in Section 5.  Figure 2 for location), which is operated by the Slovenian Environment Agency (ARSO). The measurements are acquired by a bottommounted pressure gauge in ten minute intervals, which are subsequently quality controlled at ARSO. The tidal part of the sea level is independent from atmospheric forcing and can be approximated by tidal models. We analyzed the tidal contribution to Koper SSH using the Tidal Analysis Program for Python TAPPY (Cera, 2011). Tidal contribution is estimated from a 20-year 90 hourly time-series of Koper SSH for the period between January 1995 and September 2014. Tidal constituents are then used to estimate past and future tidal values. The residual sea level is defined in this paper as the arithmetic difference between the total and the tidal sea level.

Atmospheric Data
Atmospheric data is obtained from Ensemble Prediction System (EPS) of the European Centre for Medium-Range Weather 95 Forecasts (ECMWF). The data comes as an ensemble of fifty integrations of global atmospheric models (Leutbecher and Palmer, 2007). Ensemble forecasts have a 0.125 • arc degree spatial (zonal and meridional) resolution and a 3-hour temporal resolution. In this study, the following forecast fields were subset to the Adriatic basin, represented by a 73 × 57 spatial grid (see Figure 2): (i) 10-meter zonal and meridional winds, (ii) mean sea level pressure and (iii) air temperature at 2 metres. The forecasts were linearly interpolated to hourly timesteps to match the SSH temporal resolution. Atmospheric fields over land and 100 sea are treated in the same manner, i.e. while HIDRA does receive an explicit spatial encoding of atmospheric fields (Section 3.1.1), it does not employ a land/sea mask. Direct wind influence on the ocean is exerted through vertical momentum transfer or wind stress. Configurations with both raw wind and wind stress were tested to obtain optimal neural network configuration. Whenever wind stress was used, turbulent momentum transfer (wind drag) coefficient was computed using the Large and Pond parametrization (Large and Pond, 1981).

105
Note that the purpose of wind stress parametrization in this study is not to most concisely represent the vertical momentum flux at ocean surface (which would require more complex schemes), but merely to introduce to the neural network the nonlinear wind stress dependence on the wind to assist its learning process.

Evaluation datasets
Atmospheric and sea level data described in previous sections were used to create a dataset for years 2006-2016. The first 80% 110 of the dataset is used for training (70%) and validation (10%), while the last 20% (September 2014 -December 2016) is used for testing. The data is standardised and global average pooling is used to reduce the dimensionality of the atmospheric dataspatial dimension of the data in samples is reduced in half, from 73 × 57 points to 37 × 29 points, and the temporal dimension of atmospheric data is reduced by a factor of 4. Sea level data retains 1-hour resolution. An additional test-only dataset for the year 2019 was constructed in the same manner and was used for comparison of our setup of NEMO and HIDRA prediction 115 performance.
Oversampling is applied to the training data to improve the prediction accuracy of the rare storm surge events. The training dataset is split into two subsets by thresholding the residuals at 40 cm. Storm surges (residual > 40 cm) represent approximately two percent of the data. During training the samples are randomly sampled from each of the two subsets with equal probability. corresponds to the four atmospheric surface input fields: two components of the wind stress, mean sea level pressure and air temperature at 2 m (see Figure 2).

Numerical Models
To account for the causal relation between past atmospheric and tidal forcing in the basin and future sea surface heights, HIDRA considers the forcing data over a range of past and future timesteps. In particular, for prediction starting at time t 0 , HIDRA takes as the input atmospheric tensors I t for the interval t ∈ [t 0 − T min + 1, t 0 + T max ] and the tidal and the residual The HIDRA architecture is summarized in Figure 3. Atmospheric tensors from all considered time-steps are individually encoded by an atmospheric spatial encoder (ASE) module (Section 3.1.1) and fused by the temporal encoder block (Section 3.1.2) based on the temporal attention mechanism into an atmospheric feature vector. The resulting vector is concatenated with the past tidal and residual measurements. This is followed by a residual regression block (Section 3.1.3) to generate the final 140 residual predictionsr t along with their uncertainties σ t .

Atmospheric Spatial Encoder
The atmospheric spatial encoder (ASE) encodes the spatially-represented atmospheric data into features, fine-tuned for the task of sea level prediction, i.e., the atmospheric tensor I t ∈ R 29×37×4 for time-step t is encoded into a feature vector f t ∈ R 256×1 .
The ASE architecture (shown in Figure 3a) follows design principles of the ResNet20 v2 convolutional neural network (He 145 et al., 2016), which has already demonstrated remarkable performance in image processing tasks. ASE is composed of 22 convolutional layers. Spatial dependence in feature extraction in enforced by concatenating the atmospheric tensor I t with a    spatial encoding tensor s xy ∈ R 29×37×2 , which contains the x and y coordinates (scaled between 0 and 1) of each pixel in the tensor. The augmented tensor is then processed by a convolutional layer with 16 filters of 3 × 3 size, followed by three ResNet20 v2 stages, a spatial pooling layer and a time-dependent spatial attention layer (Figure 3b).

150
Each ResNet stage consists of two residual (bottleneck) blocks, where each block contains three convolutional layers (i.e., 1 × 1, 3 × 3 and 1 × 1) and a residual connection, which sums the block input with its output. To match the number of output features in the residual connection, the first residual block in each stage uses an additional 1 × 1 convolutional projection.
Spatial feature downsampling (DS) is applied by a stride of length 2 in the first convolutional layer of the second and third stage to increase the effective receptive field of neurons. A ReLU activation layer is appended to each convolutional layer, 155 except from the first one, are pre-pended by a batch-normalization layer to stabilize the learning.
The output of the last residual block is spatially reduced by half with an average pooling layer, resulting in a feature tensor F t of size 5 × 4 × 256. Finally, a time-dependent spatial attention layer produces the final feature vector f t ∈ R 1×256 , which is a weighted sum of spatial positions is a slice of the feature tensor at time t at spatial coordinates (i, j), w (i,j) t is the respective spatial weight and ReLU(·) is the ReLU activation function. Note that the weight matrices are temporally dependent, which allows them to focus on different parts of the atmospheric feature maps over time. With the exception of the spatial attention layer, all weights of the ASE network are temporally-independent and are thus shared between all atmospheric input tensors.

165
The ASE encodes input sequence of atmospheric tensors into a sequence of atmospheric features. These are stacked into an atmospheric feature matrix F ∈ R 256×T∆ , where T ∆ is the number of time-steps in the atmospheric input tensor I t , and compressed into a single feature vector f ∈ R 256×1 by a weighted summation where w is a temporal weights vector which serves as a temporal attention mechanism ( Figure 3c) and adjusts the contributions 170 of different past time-steps to maximize the prediction performance.
The input tidal and residual sequences, each a T min ×1 vector, are concatenated with the encoded atmospheric feature vector (2) into the combined temporally-encoded atmospheric and surface height feature vector. This vector is passed to the residual regression block (Section 3.1.3) for the final prediction.

175
Probabilistic regression is employed to enable predicting the most likely residual values along with their uncertainties. For each time-step, the mean and variance of a Gaussian probability density function are predicted. The residual regression block output are thus two sequencesr andσ, each a 1 × T max vector, where T max is the prediction horizon. The residual regression block ( Figure 3) is composed of three dense fully-connected layers, each consisting of 256 units, followed by a fully-connected layer that maps into two 1 × T max vectors for the meansr and standard deviationsσ. A soft-plus function (Glorot et al., 2011) 180 is applied to the standard deviation vector to ensure positive values.

Network Training
A loss function that takes into account the probabilistic outputs is designed for training HIDRA end-to-end. In particular, a log-likelihood of the ground-truth data (i.e., the residuals series r t0 = {r t0+δ } δ=0:Tmax ), is computed under the sequence of predicted Gaussians (Figure 3,r t0 andσ t0 ) for all training sequences starting at different times t 0 ∈ T . The loss is thus defined where r are the sets of training samples, whiler andσ are the corresponding HIDRA predictions and their error estimates, respectively.
The network and experiments are implemented using Google's machine learning library TensorFlow (Abadi et al., 2015). 190 We use the ADAM optimizer with learning rate of 0.001, β 1 = 0.9 and β 2 = 0.999 to train the model. The training batch size is set to 64 data samples. The models are trained for 30 epochs, each epoch consisting of 1000 training steps (batches). A computer with a NVIDIA GeForce GTX 980 graphics card was used for model training and evaluation.

NEMO Ocean Model
General circulation model NEMO v3.6 (Madec, 2008) is used as a baseline for comparison with HIDRA. Detailed configuration 195 namelist of our particular setup described below is available in the supplementary material in the paper.
Adriatic NEMO model used in this study is set up on a regular longitude-latitude grid (648×504 cells) with a 1 • /72 arcdegree horizontal resolution and 31 vertical partial step z * -levels. The model domain spans 12-21 • E and 39-46 • N (see Figure   2). In all regions shallower than 2 m, a 2 m depth is enforced. Baroclinic timestep was set to 120 s. Barotropic timestep is adjusted to meet Courant-Friedrichs-Lewy stability condition. This operational suite runs every day at Slovenian Environment Rivers are modeled as discharge of fresh water at the respective river location as described in Ličer et al. (2016). Flather 205 boundary condition determines barotropic dynamics at the lateral open boundary, while Flow Relaxation Scheme (Engedahl, 1995) is applied for baroclinic dynamics and tracers. Lateral momentum boundary condition at the coast is free-slip. Bottom boundary layer is logarithmic with nonlinear bottom friction. Lateral diffusion is governed by Laplacian operators for tracers and dynamics, both operating over geopotential surfaces. Generic Length Scale k-scheme is used for vertical diffusion. Surface wave mixing is parametrized using Craig and Banner formulation (Craig and Banner, 1994). The full NEMO configuration 210 namelist is provided as supplementary material (Žust et al., 2021).
NEMO was run in this study without tidal forcing and predicts the residual sea level for the entire Adriatic basin with the forecast period set to 72 h (as in HIDRA). In the ensemble simulations, only eight out of fifty ECMWF ensemble members were used as forcing to our NEMO circulation model due to computational constraints. These eight ensemble members were selected from ECMWF ensemble based on the wind strength each member exhibits in the central Adriatic: ECMWF ensemble 215 members were ordered by wind strength in the central Adriatic and then a subset of members was made from the strongest to the weakest member in steps of 6 (i.e. integer part of 50/8). This generates eight possible forcing scenarios in the Adriatic basin while conserving the wind forecast spread of the reduced ensemble. For the i-th (i = 1, . . . , 8) NEMO ensemble member run, residual sea level forecast for Koper is extracted from the NEMO basin prediction as a single time-series. This is then added to the tidal time series, obtained via tidal analysis from observations, to obtain the total modeled sea level, which we 220 denote as y nemo (i, t).
Each member of the NEMO ensemble sea level forecast for Koper is further corrected for bias. This is necessary to compensate for the fact that NEMO sea level reflects departures from a local geoid and does not represent the absolute local depth of the water, which is also driven by low-frequency processes (like planetary waves in air pressure), which cannot be reflected in the 72 hour run in a regional basin. To obtain the absolute sea level needed by port and civil rescue authorities, the NEMO sea 225 level predictions have to be adjusted to the Koper tide gauge observations.
On the n-th hour of the forecast day, the model bias with respect to Koper tide gauge observations y kp (t) can is estimated as The i-th ensemble NEMO prediction time-series y nemo (i, t) is then shifted by n (i) so that the bias of the first n-hours of the 230 y nemo (i, 1 < t < n) with respect to observations is zero. Complete forecast time-series y nemo (i, t) will of course still exhibit a non-zero bias. This procedure is applied every hour as new observations from Koper tide gauge arrive. Note that, unlike NEMO ensemble, HIDRA ensemble does not need any such bias correction, because it already contains local tide gauge information through its sea level input in the day prior to the forecast and learns to adjust for the possible bias.
For operational reasons, first daily NEMO sea level ensemble run usually becomes available between 11 00 and 12 00 235 UTC at the earliest. This means that the earliest bias correction of each day generally takes into account the first 12 hours of tide gauge observations of that day. Raw NEMO time-series from i-th ensemble member y nemo (i, t) therefore gets shifted by 12h (i) to produce the first bias-corrected forecast of that day, i.e., with 12h (i) defined in (4). The bias-corrected NEMO ensemble mean, ensemble maximum and ensemble minimum time-240 series is then constructed from 12-hour bias-corrected ensemble members at each forecast timestep in an identical fashion as with HIDRA -see (6)-(8). The corresponding time-series are denoted as y bc12h (t), y max bc12h (t) and y min bc12h (t).

Ensemble Statistics
As mentioned in Section 2.2, a total of fifty ECMWF atmospheric ensemble members are available daily. This results in an ensemble of n ens = 50 sea level forecasts by HIDRA and an ensemble of n ens = 8 sea level forecasts by NEMO. The ensemble 245 mean time-series is defined as the average over all ensemble members predictions where y(i, t) is the i-th member. Similarly, the ensemble prediction envelope, i.e., the per-time-step minimum and maximum sequence, is defined as In the interest of clarity, only ensemble means, maximums and minimums, as defined above, are analyzed in the following (rather than individual ensemble members).

Results and Discussion
Predictions from HIDRA and NEMO are discussed in two sections. Section 4.1 analyzes the influence of atmospheric and sea

Influence of the Historic Horizon
Increasing the length of the historic horizon defined by the parameter T min (see Section 3.1) increases the HIDRA execution time due to a substantially increased number of parameters in the input layer. We therefore analyzed the influence of the HIDRA historic horizon length, to find the best trade-off between the model forecast accuracy and the execution time.

Contribution of Atmospheric and Sea Level Inputs
HIDRA uses two input sources: the atmospheric data and the sea level history. In this study we analyze the individual contributions of both input sources. The full HIDRA model (using both atmosphere and sea level data input, denoted as HIDRA 0 ) is compared with two single-input-source models: (i) HIDRA Ai , using only atmospheric inputs, and (ii) HIDRA SLi , using only sea level inputs. In both setups, the network branch responsible for processing the ignored input source is removed. Results are 280 presented in Table 2 and Figure 5. Table 2 indicates that HIDRA exhibits best performance when using both atmospheric and sea level input and outperforms both single-source variants by a large margin. This holds overall and also within limited time windows during storm surge events (defined as the timestamps where the residual is larger than 40 cm). Removing each source individually leads to a significant performance drop. The RMSE increases by 77% when using only the atmospheric input data (HIDRA Ai ), while a  essential for accurate prediction. Note also that both input data sources have a similar overall contribution to the prediction accuracy.
Performance analysis over prediction lead time (Figure 5a) reveals further insights. HIDRA Ai is very consistent across the entire prediction interval. In fact, the error slightly decreases over time (8% decrease in RMSE over the interval of 72 h).

290
HIDRA SLi , on the other hand, is a much better sea level predictor in the short-term, but the error increases rapidly over the prediction interval (by 400% over the interval of 72h). Thus, while the sea level data is important for short-term predictions, the atmospheric data is more informative for predictions further into the future.
Grey dashed line in Figure 5 depicts forecast errors of using the astronomical tide values as the surface height predictor.
Since tidal forecast is done independently of prediction lead time, its RMSE over prediction time window is simply a root 295 mean square error of the tidal model, plotted as a horizontal line. Bias of the reference tidal model with regard to specific residual bin is, by definition, the negated value of the residual itself, while its RMSE is, again by definition, simply the absolute value of the residual itself. Whenever HIDRA exhibits lower biases or RMSEs than tidal reference model, they are essentially predicting more accurately than a tidal model would. Note that this is the case for all prediction lead times and practically all residual values.

300
On large residual values that correspond to storm surges (Figure 5b), HIDRA Ai outperforms HIDRA SLi , confirming that atmospheric data is essential for accurate storm surge prediction. Both models achieve similar performance on small residual values, and both perform worse than the full model.

Influence of Sea Level Input Type
HIDRA considers the total sea level information split into the tide and the residual provided as separate input time series, and 305 predicts the residual which is added to the tidal signal to predict the full surface height. To analyze the contribution of different sea level input types, two additional variants were considered: (i) HIDRA res considered only the residual as the input to predict the future residuals and (ii) HIDRA sl considered a single total sea level input and predicted the total sea level output.
Results are shown in Table 3. Sea-level-only model (HIDRA sl ) performs significantly worse than reference HIDRA 0 with a 35% increase in RMSE, which speaks in favor of residual prediction over the total sea level. The residuals-only model 310 HIDRA res also performs slightly worse than the full model, causing an 8% RMSE increase, indicating that tidal information as an additional input provides useful context for improved prediction accuracy.
Performance analysis over different prediction lead times (Figure 6a) shows that the sea-level-only model HIDRA sl makes much larger errors (41% increase compared to HIDRA 0 ) when predicting far into the future (prediction lead time is high), which suggests that the network has trouble predicting the tidal component that far into the future using only the data from the 315 last 24 hours. Comparing predictions over different residual values (Figure 6b) shows a similar situation. HIDRA sl performs substantially worse (40-50% larger RMSE than HIDRA 0 ) for small residual values. This is the range at which the tidal model typically is most accurate and thus provides sufficient information for accurate predictions. Note also that although the sea level only model HIDRA sl does not use tidal information, its prediction errors still follow a similar pattern of increasing

Influence of the Atmospheric Encoder
The role of trainable discriminative atmospheric encoder ASE is analyzed by replacing it with a reconstructive embedding proposed in Hieronymus et al. (2019). A principal component analysis is applied to the atmospheric input to compute a low- that of HIDRA 0 . However, the difference increases on the less frequent conditions with high residuals (i.e. surges) in which the EOF-based version results in a 5% RMSE increase compared to ASE-based version. This supports the choice of using end-to-end learned feature encoder as opposed to a hand-crafted one.

Influence of Temporal Encoders
Temporal encoder with temporal attention weights (Section 3.1.2) plays an important part in HIDRA. Two additional variants Results are reported in Table 5 and Figure 7. Overall, HIDRA 0 performs on par with the more complex TCN-based HIDRA TCN (RMSE of HIDRA TCN is 3% larger), while LSTM-based HIDRA LSTM performs worse (RMSE is 14% larger than HIDRA 0 ). HIDRA 0 outperforms both TCN and LSTM-based versions by a solid margin on the storm surge events (31% and 32% RMSE increase, respectively). Furthermore, temporal weights of HIDRA 0 use very few parameters compared with 345 the other variants (see Table 6) -TCN and LSTM-based variations increase the total model size (including other network layers) by 50% and 150% respectively. shown. The prediction errors using the astronomic tide are presented for reference.

Influence of the Wind Input Type and Wind-Pressure Redundancy
Bora and Scirocco characteristics in the Adriatic basin are often determined through an interplay of geostrophic, orographic and other influences (Pasarić et al., 2007;Grisogono and Belušić, 2009). At other times however, non-geostrophic effects may 350 play a lesser role and the wind field is largely determined by the pressure field. To investigate potential information redundancy between the wind and pressure inputs, two HIDRA variants were trained: one which did not use the wind input and another which used the wind, but not the pressure. Results in Table 7 show that removing either wind or air pressure input leads to an approximately 9% increase of RMSE. HIDRA seems to compensate for potential redundancy in the inputs and capitalizes on the fact that wind in the basin is, in the last instance, not entirely pressure driven. In any case using both inputs is preferred. 355 We proceed to inspect the impact of Large and Pond parametrization (Large and Pond, 1981) which might oversimplify the wind stress dependence on the wind. To this end we consider another variant of HIDRA, which uses raw wind instead of wind stress. Results in Table 7 show that, overall, the performance between the two wind-input variants is indistinguishable.
However, on storm surges, using raw wind reduces the RMSE by approximately 1 cm when compared to the setup which uses wind stress. It appears that HIDRA is capable of extracting the information important for sea level prediction during storm 360 surges also directly from the raw wind.

Comparisons between HIDRA and NEMO
In regional ocean modeling setups, tides are often implemented as open boundary conditions and are treated as an external part of model's sea level response. In HIDRA on the other hand, tides enter the model as information that gets inextricably linked into its residual (i.e. non-tidal part of) prediction: results of Section 4.1.3 show that including tides improves the prediction of  Table 8) Figure 10. Power spectrum of NEMO y bc12h (t) ensemble forecast mean time-series (blue line), HIDRA ensemble forecast mean time-series y H (t) (orange line) and tide gauge observations time-series y kp (t) (black line) for year 2019. Turquoise symbols denote spectral peaks due to respective tidal constituent. E0 denotes the spectral peak due to fundamental Adriatic seiche with a period of 21.5 h, also marked with the vertical turquoise dotted line.

Specific Storm Surge Events
We now proceed to investigate the total sea level time-series predicted by NEMO and HIDRA during specific storm surges by analyzing the total sea level ensemble envelopes from a 12-hour bias corrected NEMO y bc12h (t) and HIDRA y H (t). In addition, continuous wavelet transforms (CWT, see e.g. Mallat (2009)) over time windows containing the specific storm surges 400 are computed. This allows comparison of the excitation level of specific harmonic contributions to the total sea level during each particular storm surge event. The analysis is focused on the semi-diurnal tidal signal (with periods around 12 h) and on the fundamental seiche period (with a period of 21.5 h). A Morlet wavelet was used for the CWT convolution computation. We first discuss the historic storm surge flooding event from mid-November 2019. Atmospheric conditions during November 2019 were not remarkable in themselves. Mean sea level pressures were moving between 990-1000 mbar, while Scirocco 405 speeds in Northern Adriatic were measured to be around 10 ms −1 . However, an unfortunate coinciding of a general low pressure, high neap tide and seiche-inducing high-frequency forcing due to a local pressure low caused one of the worst Northern Adriatic floods in history (Cavaleri et al., 2020). These multifaceted circumstances made forecasting of these floods, using a lower resolution forcing such as ECMWF ensemble, challenging (Cavaleri et al., 2020). This problem is at least partly reflected in the November 2019 forecasts, presented in Figure 11.

410
Panel A in Figure 11 depicts NEMO y bc12h (t) mean time-series, together with the y min bc12h (t)−y max bc12h (t) ensemble envelope, while panel B depicts HIDRA predictions. Both NEMO and HIDRA mean time-series seem to underestimate the storm surge peak values. Both ensembles, however, exhibit high forecast spread, with maximums often adequately representing the observed peaks. The first peak on 12th November 2019 is missed by HIDRA, but the subsequent dynamics is better represented in HIDRA than in NEMO. In particular, HIDRA does not exhibit a substantial false positive on 15th November and also over- Note that HIDRA excites the seiche immediately after the sea level peak on 12th November. NEMO, of course, cannot do this 420 since the seiche period is 21.5 h. Panel D of Figure 11 show that, like NEMO, HIDRA resolves well the low frequency tidal variability between spring and neap tides.
We now move to an event from late January and early February 2015, which turned out to be quite problematic for NEMO to forecast, while HIDRA behaved much better. During this period, Adriatic was impacted by several days of low pressures (990-1000 mbar) and moderate Scirocco (with speeds 8-12 ms −1 ). These conditions led to a series of moderate storm surges 425 in the Northern Adriatic, as shown in Figure 12.
NEMO ensemble, depicted in Panel A of Figure 12, performs particularly poorly during this time window. While it did predict the first surge on 30th January, the following peaks were underestimated and the crest-to-trough sea level range of NEMO is overall unsatisfactory throughout the time-window. Since the tidal part of the NEMO signal is appropriate ( Figure   12, Panel D), the reason for poor forecast seems to lie in insufficient excitation of the fundamental basin seiche, which is 430 drastically underestimated in our setup of NEMO ( Figure 12, Panel E). HIDRA yields a more accurate overall forecast in this case (Figure 12, panel B), and does not underestimate the seiche signal at the (21.5 h) −1 frequency as much as NEMO.
Semi-diurnal tidal signal is reasonably well represented in both models (panel D of Figure 12). These performances of NEMO and HIDRA are consistent with the power spectrum in Figure 10.  In this study, we presented HIDRA, a novel deep learning network for sea level modeling in complex environments like the Adriatic. We describe key HIDRA architecture blocks and discuss several aspects of how both HIDRA architecture and its input influence its performance. HIDRA compares favorably to the current operational NEMO setup of the National Hydrological Forecasting Service at Slovenian Environment Agency. While further tuning of the operational NEMO setup at the Agency is also under way (with the aim of improving its forecasting skill), results presented in this study nevertheless indicate that 440 HIDRA is an appropriate candidate for the Agency's operational pipeline. Preliminary tests (not reported in this study) indicate that HIDRA also generalizes well to other geographical locations.
Last but not least, numerical cost of both setups is vastly different. NEMO ensemble runs require dedicated HPC facilities, while the HIDRA ensemble forecast can be executed on a personal computer (even without a dedicated GPU) and exhibits an extremely low energy footprint. A single HIDRA run for our requirements takes less than half a CPU second per ensemble 445 member, while a full basin NEMO ensemble requires tens of CPU hours per ensemble member -a speedup in order of 0.5×10 6 times.
We believe the presented results are a promising first step. In our future work we plan to focus on improving the performance of both HIDRA and NEMO in the tails of the sea level distributions as well as explore other environmental input streams and architectural designs to further reduce the prediction errors with increasing forecast horizon. We hope this study builds a strong 450 case in favor of machine learning capabilities with carefully designed architectures to discern sea level dynamics in regional basins and will inspire other groups to consider similar solutions.
Author contributions. LŽ and MK designed and implemented HIDRA. ML provided the physics-related background for HIDRA architecture. AF obtained and preprocessed ECMWF ensemble data. AF and ML implemented NEMO ensemble. ML, LŽ and MK analyzed the results and wrote the paper. All authors contributed to the final version of the manuscript.