A Bayesian data assimilation framework for lake 3D hydrodynamic models with a physics-preserving particle ﬁltering method using SPUX-MITgcm v1

. We present a Bayesian inference for a three-dimensional hydrodynamic model of Lake Geneva with stochastic weather forcing and high-frequency observational datasets. This is achieved by coupling a Bayesian inference package, SPUX , with a hydrodynamics package, MITgcm , into a single framework, SPUX-MITgcm . To mitigate uncertainty in the atmospheric forcing, we use a smoothed particle Markov chain Monte Carlo method, where the intermediate model state posteriors are resampled in accordance with their respective observational likelihoods. To improve the uncertainty quantiﬁcation in the particle ﬁlter, we develop a bi-directional long short-term memory (BiLSTM) neural network to estimate lake skin temperature from a history of hydrodynamic bulk temperature predictions and atmospheric data. This study analyzes the beneﬁt and costs of such a state-of-the-art computationally expensive calibration and assimilation method for lakes.


Introduction
Lake management is a constantly evolving tradeoff between different conflict of interest. The most obvious is that lakes are easily accessible sources of drinking water but are also the place into which wastewater is ultimately discharged. Lake stakeholders traditionally evaluate the evolution of lakes from in situ observations. While still not widely used for this purpose, previous studies clearly showed the benefit of one-and three-dimensional hydrodynamic models to project different scenarios for the short-or long-term future (Gaudard et al., 2019;Soulignac et al., 2019;Vinnå et al., 2021).
While a number of dedicated monitoring projects already exist for a number of large lakes, operational, fully threedimensional (3D) models are quite sparse. The most notable is the NOAA Great Lakes Operational Forecast System (GLOFS; Chu et al., 2011;Anderson et al., 2018), which provide comprehensive predictions (water temperature, velocity and level, and ice cover) for all of the Laurentian Great Lakes. Over 25 years, the forecasting service has been continuously improved with better and more sophisticated models. Currently, data assimilation is used for 7716 A. Safin et al.: Physics-preserving data assimilation method for lake 3D hydrodynamic models calibration only (Anderson et al., 2018), but there is research toward making it part of the operational mode as well (Ye et al., 2020). Another platform is Meteolakes, which provides short-term water temperature and velocity forecasts for the Geneva, Biel, Zurich, and Greifen lakes in Switzerland (Baracchini, 2019;Baracchini et al., 2020a, b). The platform uses an ensemble Kalman filter to assimilate remotely sensed lake surface water temperature (LSWT), which reduced the mean temperature prediction error by half. An additional benefit of the ensemble filter was a better prediction of mesoscale physical processes such as gyres and upwellings (Baracchini et al., 2020a). However, due to the limitation of the assimilation scheme, only a fraction (≈ 3.7 %) of the available LSWT images were used.
The purpose of this study is to investigate a novel approach to the data assimilation of highly heterogeneous data using Bayesian inference techniques applied to a 3D hydrodynamic model of Lake Geneva. The model relies on the ensemble affine invariant sampler (EMCEE; Goodman and Weare, 2010; Šukys and Bacci, 2021) to calibrate the distributions of physical model parameters. The EMCEE is a sampler for Markov chain Monte Carlo that is particularly well suited for nonlinear parameters. The advantage of this approach over standard inference methods is that it provides a more informative and accurate parameter estimation, albeit at higher computational expense. To increase the confidence in the sampling algorithm, we used a particle filter method that provides trajectories consistent with the hydrodynamic model (Andrieu et al., 2010;Šukys and Bacci, 2021). In particular, as a substitute to the more well-known Kalman filter and the 4D-Var algorithms, the trajectories themselves are resampled based on their respective observational likelihoods, with the more probable realizations stochastically forming the basis for sequential predictions. Importantly, the filter does not modify the model states (they are only deleted or replicated instead), and therefore, the predictions do not exhibit shocks generated by some data assimilation models.
A proper assimilation of remotely sensed lake surface water temperature (LSWT) requires an estimation of the water surface temperature from the hydrodynamic predictions. Thus, we deploy a bi-directional long short-term memory (BiLSTM) neural network to estimate the skin temperature of the lake and to quantify its uncertainty. The network relies on a 27 h history of hydrodynamic model bulk temperature and atmospheric predictions as inputs for the conversion. The neural network was trained using 16 months of data (the year 2018 and January-April 2020) together with Me-teoSwiss COSMO-1 atmospheric model reanalysis and Meteolakes water bulk temperature predictions.
We present the openly available SPUX-MITgcm framework, which integrates the Bayesian inference algorithms of the SPUX package  with the hydrodynamics of the MITgcm code (Adcroft et al., 1997) and the trained BiLSTM network. To the best of our knowledge, the data assimilation and particle filtering approach that we pro-pose in this paper has not been previously tested for fully three-dimensional models due to the relatively high computational costs of the model parameter posterior estimation and lack of supporting software. We investigate the viability of this approach and analyze the performance of individual components. The results demonstrate that, while our methodology improves model performance, the framework requires further improvements to become usable for practical applications.

Data and numerical model
In this section, we describe the available data, the hydrodynamic model, and our data assimilation approach. As data and software reproducibility are essential for more open and accessible research, in the Supplement, we provide documentation on accessing and running the numerical model, which enables a full replication of the results (over a short period of time) that we present in this paper.

Study site
Lake Geneva is the largest freshwater lake in western Europe and is located on the border between France and Switzerland, covering an area of approximately 580 km 2 , with an average depth of 154 m. Spanning 73 km along its longest axis, and with a maximum width of 14 km, the lake consists of a wider and deeper main portion in the east and a narrow and shallow portion in the west. The water level and discharge rate into the Rhône river are managed by a dam on the western end of the lake. Lake Geneva is predominantly vertically stratified in density due to temperature, although complete mixing does occur every few years. The mountainous nature of the region significantly affects the wind patterns over the lake, with northeast and southwest being the prevalent directions. These wind patterns, along with seasonal variability in light penetration depth, significantly affect the thermal structure of the lake (Bouffard and Lemmin, 2013;Bouffard et al., 2018). As the mean water residence time in the lake is 10 years, the primary factor driving the lakes' dynamics is the atmospheric forcing.

Hydrodynamic model
We simulate the hydrodynamics of Lake Geneva using the MITgcm (Adcroft et al., 1997) package (tag "check-point67q"), which uses the finite volume method to solve the incompressible Navier-Stokes equations under the Boussinesq approximation. Alternative package options were FV-COM (Finite Volume Community Ocean Model; Chen et al., 2006), used by the Great Lakes Environmental Research Laboratory (GLERL), and Delft3D-FLOW (Deltares, 2013), used by Meteolakes. We use a hydrostatic formulation combined with a third-order direct space-time flux limiter advection scheme (Prather, 1986). A nonlinear equation of state by The simulations are performed on a Cartesian grid (z coordinate system), with 1 km horizontal resolution and 50 vertical layers that gradually increase in thickness from 1 m at the surface to 21 m in the deepest portion of the lake. We chose a time step of 60 s. While larger time steps were still numerically stable, a smaller value was helpful in reducing vertical temperature over-diffusion into the deep layers. To improve the accuracy of the topography and reduce spurious artifacts near the bottom, shaved cells are allowed. We apply freeslip boundary conditions to both the horizontal and vertical boundaries and use the non-dimensional bottom drag coefficient from Bouffard and Lemmin (2013) to enable energy dissipation. On the surface, we use an implicit free-surface formulation.
We model the vertical mixing processes using the nonlocal K-profile parameterization (KPP) scheme (Large et al., 1994), which is commonly used in oceanography. Small background vertical diffusivity and viscosity parameters are included to ensure stability (see Table 1). For equivalent parameters on the lateral scales, we manually tuned the eddy viscosity and diffusivity parameters by minimizing the difference between model predictions and in situ temperature profiles at Buchillon station. While a more optimal approach is to infer these parameters, our data assimilation scheme found them difficult to identify (see the Supplement).
Surface forcing inputs were derived from the MeteoSwiss COSMO-E numeric weather prediction model, which are made at 2.2 km resolution. While the COSMO-E model generates an ensemble of 21 predictions, in previous hydrodynamic models of Lake Geneva, only the mean and spread were used (Baracchini et al., 2020b;Cimatoribus et al., 2018). In Sect. 2.4.2, we describe a data assimilation approach that makes use of the individual ensembles which represent the span of weather dynamics more accurately. This approach has the additional advantage of not requiring the estimation of the spatiotemporal noise parameters that Baracchini et al. (2020b) used to add stochasticity to their model. Air pressure, air temperature, wind velocity, longwave radia-tion, relative humidity, and cloud coverage are used to determine the input fields in accordance with Fink et al. (2014), where the wind drag coefficients of Wüest and Lorke (2003) are used to improve surface stress coupling at low wind speeds. We also include the inflow and outflow of the Rhône river, using the volume flow and temperature measured a few kilometers upstream at Porte du Scex Station (Swiss Federal Office for Environment, FOEN). As smaller tributaries and precipitation/evaporation are not taken into consideration in the model, the water level in the model is manually adjusted to the measured values from the St. Prex station.
Correct transfer of heat and energy from the atmosphere is an essential component of a well-performing hydrodynamic model, especially in summer. In this regard, the bulk transfer coefficient of sensible heat (Dalton number) is a significant parameter that, in several studies (Verburg and Antenucci, 2010;Baracchini, 2019;Rahaghi et al., 2018), has been shown to be larger than the default values used in ocean simulations. Therefore, we seek to infer this parameter. In addition, to more realistically accommodate fluctuations in water transparency, we estimate a spatially uniform Secchi depth using the photosynthetically active radiation (PAR) data from the LéXPLORE moorings (Wüest et al., 2021) to determine the attenuation rate of shortwave energy in the water column. In the future, given the spatial variability in Lake Geneva Soulignac et al., 2019), a better approach might be to use remote sensing data to estimate the Secchi depth for different lake locations. Finally, we use the albedo formula of Cogley (1979) to account for seasonal changes in surface reflectivity. The Secchi measurements and albedo values are visualized in the Supplement.

Observational datasets
A particular advantage of the Bayesian framework is the natural ability to handle multiple sources of data with their respective uncertainties. For Lake Geneva, the observations are either in the form of an in situ measurement or remotely sensed surface temperature.

In situ
In situ datasets used in the simulations are summarized in Table 2 and their locations are displayed in Fig. 1. To make the data assimilation process much more manageable, the data from LéXPLORE and FOEN have been subsampled to the hourly rate from the original intervals of 5 s and 10 min, respectively. The vertical resolution of the LéXPLORE dataset was also reduced to match the model discretization levels. Finally, due to the coarse horizontal resolution of the model, only the magnitude of velocity was considered to be a means of calibrating the kinetic energy of the lake.

Remotely sensed temperature
A processing chain from the University of Bern enables the extraction of LSWT images at a resolution of 1 km from the orbital Advanced Very High Resolution Radiometer (AVHRR). On average, typically 10 images of Lake Geneva are generated per day, of which around two are deemed usable by the retrieval process. The quality of an individual snapshot is affected by a number of factors, such as the zenith angle, cloudiness, or sensor errors (Riffler et al., 2015;Kilpatrick et al., 2001). In Lieberherr and Wunderle (2018), a system of assigning a quality flag (QF) for the different satellite measurement conditions was developed. An analysis based on in situ lake data provided an estimate of the uncertainties and biases, ranging from 1.3 • C for QF 6 to 1.5 • C for QF 1. However, we believe that a more accurate uncertainty model can be established, and therefore, in Sect. 2.4.3, we detail an alternative approach based on machine learning that uses a history of model predictions and weather conditions to generate a bulk-to-skin estimate together with the associated uncertainty.

Data assimilation
Lakes, akin to the atmosphere, are highly volatile and sensitive systems. For hydrodynamic models, this means that small perturbations in models states can significantly alter the resulting trajectories. Due to the multiple sources of uncertainty present in numerical discretization models, most importantly the uncertainty in forcing terms, such trajectory deviations are ultimately unavoidable. The remedy comes in the form of data assimilation (DA), a framework for providing trajectory corrections based on actual observations.
In discussing sources of uncertainty, a distinction should be made between systematic and random errors (Lahoz and Schneider, 2014). Systematic errors (or bias) arise due to the discrepancy between the model and the actual underlying physics it tries to represent. They can be reduced by certain techniques, such as parameter calibration, but typically cannot be eliminated entirely. Random errors, on the other hand, correspond to inherent uncertainties in the input (weather forcing and LSWT are quite noisy, for example). The DA framework we propose in this paper is designed to deal with these two types of error separately -a particular sampler to reduce systematic errors (parameter inference) and a particle filter to mitigate the stochasticity from weather forcing.

Choice of DA model
A variety of DA algorithms have been proposed and deployed in operational settings and offer different performance-to-optimality tradeoffs. The two primary subfields are variational and sequential methods, with some allowing bias correction (Lahoz and Schneider, 2014). In the variational approach, an objective function describing the discrepancy between the model and observations is sought to be minimized. The 4D-Var is the most popular form and offers the capability to transfer information from observed to unobserved regions for nonlinear models. Some of the drawbacks of this approach are the relatively large numerical costs, both for the model and uncertainty, reduced flexibility, and complex implementation of time-dependent parameters (Lahoz and Schneider, 2014;Baracchini et al., 2020b). From the sequential methods, Kalman filters (KFs), which iteratively evolve a forecast together with error covariance matrices, are an optimal approach. As the true KF is expensive due to the cost of computing the covariance matrix in model space, local model linearity is frequently assumed. For high-dimensional systems, this, however, can result in significant errors in the state and covariance estimation.
Significant performance enhancement is enabled through ensemble methods, where a collection of model states M j is propagated forward in time, and the resulting trajectories enable the estimation of covariance. The trajectories differ as a result of varying stochasticity in the initial conditions or forcing terms. The ensemble KF (EnKF) is an efficient and highly popular sequential scheme, providing greater stability and an easier covariance estimation (Evensen, 1994). The EnKF has successfully been applied to hydrodynamic forecasting of Lake Geneva by Baracchini et al. (2020a), with a 54 % reduction in temperature error in comparison to an unassimilated model. In recent years, the local ensemble transform KF (LETKF) has been tested in a number of weather prediction frameworks with encouraging results (Gustafsson et al., 2018). The assimilation techniques introduced above have the limitation of assuming that uncertainties and model states are Gaussian and that the model is linear (Lahoz and Schneider, 2014). While this assumption is perfectly reasonable for many applications, non-Gaussian observational error and parameter distributions can be problematic for such methods. For higher-resolution models, the traditional approaches of  (Gustafsson et al., 2018;van Leeuwen et al., 2019). In our model, we employ a particle Markov chain Monte Carlo (MCMC) method, which is highly suitable for nonlinear problems (Andrieu et al., 2010;Šukys and Bacci, 2021). The MCMC algorithm is used to infer selected hydrodynamic model parameters (see Sect. 2.4.2), with an ensemble affine invariant sampler (EMCEE) for the parameter acceptance/rejection criterion. A visualization of the process is shown in Fig. 2. The EMCEE sampler is particularly effective for poorly scaled distributions that become well conditioned under affine transformations and can be significantly faster than standard MCMC approaches on highly skewed distributions. A more thorough discussion of the advantages and disadvantages is given in Goodman and Weare (2010). Here, we only provide a brief overview of the mechanism. The EMCEE algorithm initializes with an ensemble of Markov chains (walkers), {X i }, drawn from a prior probability distribution, (x), and split into two subsets, S 1 and S 2 . Their marginal likelihood is estimated as explained in the paragraph below. First, we update all the walkers X j from S 1 using the stretch formula X j,new = X j + Z[X j − X k ], where X k is a randomly chosen walker from S 2 , and Z is a scaling variable. Each proposed update X j,new is confirmed in accordance with a Metropolis acceptance probability. The next step is to update S 2 walkers using the updated S 1 elements. We continue alternatively evolving walkers from S 1 and S 2 until a suitable convergence criterion is met.
For each EMCEE parameter, a particle filter (PF) is deployed to address the stochasticity of the weather predictions. The PF, implemented in Šukys and Bacci (2021), works as follows. For each parameter α (k) i , we initialize m model states, M j , and simulate until an observation is reached at time t = t 1 (see Fig. 2). At this point, the model simulations are paused, and all particles are resampled (bootstrapped) according to their observational likelihoods. Thus, certain model states will be deleted and replaced by some other state from a different trajectory. The models are propagated using this mechanism until all the data are assimilated. Such a resampling algorithm significantly increases the algorithmic and implementation complexity due to the required destruction and replication of existing particles. However, it also provides an efficient way of sampling intermediate posterior model states. To the authors' best knowledge, this is the first application of such a filtering algorithm to a fully threedimensional model. A particular benefit of this approach is that the stochasticity from the atmosphere is sufficient to generate trajectories that manage to track the observational data with proper model parameters. This is in contrast to the nonphysical correction vector in many other DA schemes that are necessary to nudge trajectories toward the data. Aside from potentially causing instabilities, these latter approaches decrease the confidence in the fidelity of the underlying model, as the correction mechanism potentially also corrects a model deficiency. At the same time, as no model is perfect, the SPUX PF offers limited capability to handle the biases that the sampler does not eliminate. In addition, the strict nature of our PF (model states cannot be modified) significantly limits its performance, and thus, it cannot be expected to outperform the abovementioned established alternatives.

Implementation using SPUX and design of numerical experiments
For data assimilation and particle filtering, we use the SPUX package (Šukys and Bacci, 2021), a modular framework for parallel Bayesian inference with a user-friendly programming interface. The 3D hydrodynamics package, MITgcm, was modified to allow a Secchi depth value argument for every simulation hour and was built as a shared library to enable interfacing with SPUX using the ctypes package. The ctypes approach, as opposed to launching the model as a subprocess, provided a noticeably faster and more stable performance. During calibration, EMCEE was configured to run with 16 chains distributed over 8 parallel workers, with 10 particles per filter. This required a total of 89 parallel workers (note that nine of these workers are managers that assign tasks but do not run simulations themselves). The simulations were run at the Swiss National Supercomputing Center (CSCS) over a period of approximately 3 months (≈ 10.5 h 7720 A. Safin et al.: Physics-preserving data assimilation method for lake 3D hydrodynamic models Figure 2. A simplified visualization of the data assimilation framework that we use in this paper. The EMCEE sampler proposes 2n walkers (also called chains in SPUX), whose marginal likelihood is then evaluated using the particle filter (PF). The PF resamples trajectories via deletion/replication with respect to their dataset and observational error model.
to predict 11 months using the PF and ≈ 21 h for a full EM-CEE sampler iteration). As described in Sect. 2.2, we chose two hydrodynamic model parameters to infer. In both cases, we can establish an interval (a, b) that contains the optimal value, but we otherwise assume a uniform distribution, U , within. The first parameter is the Smagorinsky harmonic viscosity coefficient C smag to which we assign a prior distribution U (2, 4), consistent with Griffies and Hallberg (2000). To the second parameter -the Dalton number, C D -we assign the prior U (0.045, 0.06) based on a preliminary sensitivity analysis. For Dalton values outside the prior, heat exchanges with the atmosphere were not modeled satisfactorily. For example, the default MITgcm value C D = 0.0346 results in a significant temperature underprediction during the summer months.
Observational data spanning 15 January-15 December 2019 are used for the calibration and data assimilation (DA) run. Attempts to calibrate the model parameters using a shorter timeframe generated posteriors that provided suboptimal performance for the whole year (see the Supple-ment for more of a discussion) and was therefore discarded. To analyze the effectiveness of model predictions, a control run (CR) is made without filtering and using parameters C D = 0.045 and C smag = 2 as the baseline.

Bulk-to-skin conversion using LSTM
As the AVHRR operates in the infrared portion of the spectrum, it effectively measures the skin temperature in the top few millimeters of the lake. In the bulk region immediately below this surface layer, the temperature can be significantly different due to a number of factors, such as wind and solar radiation (Wick et al., 1996;Alappattu et al., 2017). As the hydrodynamic model generates bulk temperature predictions, a bulk-to-skin function (forward operator) is necessary for the observation space error model. Existing estimates based on oceanographical studies (e.g., Alappattu et al., 2017) do not directly translate to lake research due to the differences in typical weather conditions such as frequent low wind conditions over lakes (Bouffard and Wüest, 2019). For lakes, an accurate bulk-to-skin parametrization would incorporate effects due to convectively driven surface turbulence (Wilson et al., 2013) and more accurate implementation of the air-water gas exchange, especially in the presence of surfactants (Bouffard and Wüest, 2019). While such a parametrization might be possible, it would be technically difficult to implement, and therefore, we attempt a different approach.
Instead, we implement a neural network using bidirectional long short-term memory (BiLSTM) blocks that use the 27 h history of 18 feature inputs to make a skin temperature prediction. In total, 16 of the features come from the means and their respective spreads of the MeteoSwiss weather predictions (air temperature, cloud cover fraction, wind velocity, relative humidity, precipitation, and shortwave and longwave radiation). The last two features are the hydrodynamic model temperature predictions and hour of the day. The model was trained using data from 2018 and 2020, with the bulk water temperature predictions extracted from the Meteolakes model (Baracchini et al., 2020b). Data from 2019 were separated from the training for benchmarking purposes. A schematic of a BiLSTM block and the neural network is shown in Fig. 3. Input to the neural network is provided as a (27 time step, 18 channel) tensor. An initial, fully connected layer maps these 18 channels to 32 channels. The output is then sequentially passed through three separate BiLSTM cell blocks. Finally, the output of the last BiLSTM block is linearly mapped to a two-channel output, which corresponds to temporal predictions of the skin temperature and their predictive log variances. In our experiments, LSTMs were crucial in order to exploit historical patterns in the input features when predicting the skin temperature. In a recent work, an extension of the LSTMs to a spatially dependent model is presented by Stalder et al. (2021).
For the particle filter, the uncertainty quantification of the predicted skin temperature is also necessary. Accordingly, the BiLSTM blocks in our neural network randomly disables 30 % of the LSTM units to induce stochasticity. This allows our model to also implement additional methods to quantify epistemic and aleatoric uncertainty (Kendall and Gal, 2017). The Monte Carlo dropout approximates predictions from an ensemble that can be used to quantify the epistemic uncertainty. On the other hand, using the negative log likelihood of a normal distribution as the objective function in training allows BiLSTM to also estimate the predicted variance that can be used to quantify aleatoric uncertainty. Specifically, for a given input of 18 feature observations over a 27 h history, the neural network predicts the corresponding 27 h skin temperature estimations as well as their estimated logarithmic variances. During the optimization phase of the neural network, the negative log likelihood of a normal distribution from Kendall and Gal (2017) is used, as follows: where θ are the neural network weights, y i are the observations, andŷ i are the skin temperature estimations with corresponding logarithmic variance estimation s i . At the test time, we generate model predictions for 19 times for each input, while keeping dropouts within BiLSTMs active, yielding 19 different prediction vectors, similar to an ensemble model. While the mean of the estimated skin surface temperature is used for mean estimates, we use the variance of the 19 predictions for epistemic uncertainty. Aleatoric uncertainty is computed from the predicted variance estimates by taking their average after mapping the predicted logarithmic variance into variance. The total scalar variance is computed as the sum of the two variances. Accordingly, we construct a normal distribution with the computed total variance centered at the mean skin temperature prediction to be evaluated against the LSWT measurement.

Results and discussion
In this section, we report on the inference results, with the initial focus on the model parameter inference. As the calibration mechanism operates on distributions, not scalar quantities, the process is more complicated than what is typically done. Therefore, this warrants a closer look at the posterior distributions and diagnostics. Then, we compare the results obtained using the best posterior parameter set, and finally, we evaluate the performance of the BiLSTM network as a predictor of skin temperature.

Hydrodynamic model calibration
The posterior distribution of two hydrodynamic model parameters (Smagorinsky viscosity and Dalton number) were 7722 A. Safin et al.: Physics-preserving data assimilation method for lake 3D hydrodynamic models estimated in SPUX using the EMCEE sampler (see Fig. 2). For each of the 16 parameter sets, the EMCEE sampler updates a parameter in case a better-performing one is found or it is deemed to be stuck (no changes for 10 iterations). The PF is not guaranteed to choose the optimal global trajectory for a parameter set, given the computational constraints and the resetting of stuck chains , and therefore, some uncertainty around the optimal parameter value is to be expected. In Fig. 4, we show the evolution of the Markov chain parameters in terms of their means and the 5th-95th percentiles. We can observe from the relative stationarity of the distributions that the convergence to the true posterior distribution was likely achieved. We thus conclude that the optimal model parameters were localized with a sufficiently high degree of confidence. The inferred marginal posterior distributions are shown in Fig. 5 in orange, with the prior distribution shown in blue. The vertical red dashed lines indicate the best-found parameter set, with the values shown in the table to the right. In relation to their respective priors, the Dalton number is predicted with a relatively high degree of confidence, while the posterior for the Smagorinsky parameter is less defined. This is to be expected as the Dalton number has a stronger effect on model predictions and is therefore more sensitive. However, with more iterations, we expect that a more centered posterior for the Smagorinsky viscosity parameter would have been obtained.
We also consider the average redraw rate -the fraction of particles that form a basis for each sequential predictionin Fig. 6. The survival rates dip significantly in cases when remote sensing data are assimilated or LéXPLORE data are available.

In situ data assimilation results
We present the results of the assimilated data predictions in this section, with the control run (CR) serving as the baseline for comparison. The data assimilation (DA) prediction was generated using the MAP values, which are given in Fig. 5   on the right. In addition, whenever possible, we compare the aggregate error metrics to the values given in Baracchini et al. (2020b) for their 2017 data (specifically, we do not use their 2019 data which exhibit a noticeable drift in the deeper layers of the lake in comparison to SHL2 observations). The performance of the model for the Buchillon, SHL2, and LéX-PLORE in situ datasets are considered in this section. Figure 7 shows the temperature for the Buchillon station at depths of 1 m (left) and 35 m (right). The gray lines represent the measurement, the red line shows the CR, and the blue line is the calibrated and data-assimilated (DA) run. As is evident from the results at both depths, the CR already captures the seasonal variation and the high-frequency fluctuations quite accurately. The DA run improves the model performance in the summer, where the CR underpredicts the near-surface temperature, as shown more clearly in the inset which focuses on the predictions for July. For the entire dataset, the root mean square error (RMSE) decreases from 0.95 • C for the CR to 0.77 • C for the DA run. Overall, the improvement in RMSE and mean average error (MAE) across the various datasets is 4 %-15 % (summarized in Table 3). The main source of the performance improvement is the better resolution of the near-surface temperature in summer.
We now consider the vertical temperature column profiles from the SHL2 location, which provides monthly measurements at the deepest location in the lake. As both the CR and DA run predictions below 100 m are in agreement with the measurements, we focus on the upper portion of the column. In Fig. 8, we analyze the performance of the models for a few selected snapshots. In Fig. 8, the black dots represent measurements, the dashed red line is the CR, and the thick blue line is the DA run. The results show that the DA run tracks the observed temperature near the surface with greater accuracy, which also results in better modeling of deeper-layer temperature. Below 60 m the observations are followed quite accurately by both the CR and DA run, without any substantial difference between the models. Figure 9 compares the observed vertical column temperature at the LéXPLORE location to the DA run. The vast bulk of measurements for this location (both temperature and velocity) were obtained during two periods, as reflected in the figure. Data for the period 21 June-11 August are shown on the left, and the right side focuses on 25 October until 15 December. For simplicity, we will refer to the respective time intervals as summer and autumn. Outside of those time periods, the measurements were sparse and are therefore omitted from comparison. The results show that the DA algorithm models the water column temperature quite accurately and, in particular, correctly reproduces the thermocline depth in the summer. Furthermore, the cooling cycle at the end of the year is captured quite well. Above the thermocline, the difference plots highlight that the DA run overpredicts the temperature in the summer months, which, as a result, generates a smaller and dissipating warm bias in autumn. The discrepancy could potentially be reconciled with a higher-resolution model or more accurate Secchi depth estimates. In general, correcting temperature overprediction in the subsurface turbulent layer is a difficult problem exhibited in many studies Soulignac et al., 2018;Ye et al., 2020), without a clear consensus on the underlying causes for each case.
Due to the coarse horizontal resolution of the mesh, the data assimilation algorithm primarily focused on tracking temperature. As a result, the acoustic Doppler current profiler (ADCP) data played only a secondary role in determining the trajectory of the model. In Fig. 10, we present the kinetic energy spectra computed from the LéXPLORE ADCP data (gray lines) and the DA run (blue lines) based on different datasets. Figure 10 (left), based on summer readings for the 15 m sensor, shows excellent agreement for the kinetic energy variability above the semi-diurnal (12 h) mode. For the same sensor location, model predictions underperform during the autumn (Fig. 10, right), indicating that the high-frequency internal wave modes are not being resolved by the hydrodynamic model. A potentially large contributing factor for the discrepancies is the relatively coarse spatial resolution used in the model.

LSWT assimilation using BiLSTM network
On average, an LSWT image provided 209 usable pixels and significantly affected particle survival rates. For exam-7724 A. Safin et al.: Physics-preserving data assimilation method for lake 3D hydrodynamic models  ple, in Fig. 6, most of the low survival rates are due to the AVHRR data, which are especially noticeable in cases when LéXPLORE data are not available. In contrast to Baracchini et al. (2020b), where a highly selective criterion was applied, we use all the pixels which have an associated non-zero QF value. This allowed a much more frequent remote sensing data assimilation, with 798 (of 2092 total) images usable for the data assimilation period. The use of LSWT has enhanced the model predictions by only a 4 % reduction in the in situ RMSE. In Fig. 11, we show an example comparison between the observed LSWT (left), the hydrodynamic model bulk temperature (center), and the BiLSTM prediction (right) for a selected snapshot. The result shows that BiLSTM can predict the spatial variability and structure of LSWT images, although it also frequently generates an entirely different profile. In general, as these improvement results are not particularly informative, we instead focus on the overall performance of the assimilation model and the BiLSTM predictions.
The global training and performance of the BiLSTM model are summarized in Table 4. The training and testing used the means and spreads of the MeteoSwiss weather predictions in combination with Meteolakes mean bulk temperature. The results show that the network significantly improves predictions of LSWT for the training set and achieves a 33 % reduction in RMSE for the test set. In the assim-ilated run, the bulk prediction difference is already small and only worse than BiLSTM test set performance. In fact, the BiLSTM network increases the RMSE by about 10 %, which is most likely attributable to the differences between the training data and the assimilation process. In general, as LSWT measurements carry significant uncertainty (1.3-1.5 • C RMSE), the analytic capability is limited by the lack of exact skin temperature measurements for 2019.
Surprisingly, the predictive capability of the BiLSTM seems to improve for the LSWT pixels with an associated QF in the range 2-5 (42 % of the net pixel count), with an RMSE of 1.92 versus 2.11 • C for direct bulk comparison. For the highest quality data (QF 6), however, BiLSTM does not improve the performance. In Fig. 12 (left), we analyze the performance of the BiLSTM (orange lines) against hydrodynamic model bulk predictions (in blue) for the different QFs by plotting the mean RMSE with 10th-90th percentiles. The gray bar chart shows the total number of LSWT measurements for the particular QF level. Aside from QF 5, the bulk RMSE gradually increases for lower QFs, as expected. At the same time, the BiLSTM error is practically constant, with lower uncertainty, indicating the network's capability to predict lower-fidelity data. For QF 5, the discrepancy nearly doubles, indicating a significant issue with this LSWT data subset. In contrast to the relatively uniform spatial frequency of other QFs on the lake, Fig. 12 (right) shows that most of   the associated measurements occurred near the shore, which are harder to predict accurately due to the resolution of the hydrodynamic model. However, as LSWT pixels with associated QF 5 are extremely rare (Fig. 12, left), they do not significantly contribute to the overall result. To enable uncertainty quantification (UQ) for the PF, each individual bulk-to-skin prediction is also equipped with a normal uncertainty distribution, as described in Sect. 2.4.3. Since the network used the hour of the day and weather prediction uncertainty as part of its training, we can expect some spatiotemporal variation in the predicted spreads. Therefore, we analyze those two factors here. In Fig. 13 (left), we show the hourly mean BiLSTM prediction difference with LSTM  data as a solid blue line and the 10th-90th percentiles of the BiLSTM UQ as shaded blue regions. The results suggest that neural network UQ is capable of slightly better predicting the spreads for the different times of the day. In contrast, the default LSWT error model provides relatively uniform uncertainties across all times of the day, regardless of the MAE (Fig. 13, right). In general, the improvement is, however, quite mild, potentially due to the low spatial resolution of the model. Finally, we consider the spatial pattern of BiLSTM predictions and, in particular, examine whether its UQ can predict regions of the lake with larger model discrepancies. In Fig. 14 (left), we present the mean BiLSTM prediction RMSE for the different pixel locations over the lake. As can be expected, the best performance is obtained for offshore pixels in the eastern portion of the lake (Grand Lac). The predictive capability of the network reduces nearshore and especially in the western portion (in the Petit Lac). The BiL-STM uncertainty predictions follow a much similar pattern (Fig. 14, right), including the large increases in uncertainty near the north shores. In part, this increase can most likely be attributed to the greater uncertainty in the atmospheric weather conditions near the land-water interface and the fact that the majority of observations for training BiLSTM come from offshore in the Grand Lac, which would likely create a bias in the model predictions.

Conclusions
We presented the SPUX-MITgcm framework, a novel approach to the calibration of a hydrodynamic model for a highly spatiotemporally heterogeneous observational dataset. The inference makes use of the ensemble affine invariant sampler (EMCEE) to infer the distribution of the model parameters coupled with a particle filter (PF) for stochasticity in atmospheric forcing. The PF relied on resampling existing trajectories, based on their observational likelihoods, to infer the most probable weather conditions over the lake. As a result, the PF generated physically realistic trajectories (at least with respect to the hydrodynamic model). In addition, to enable the proper assimilation of remotely sensed lake surface temperature, we developed a bi-directional long short-term memory network (BiLSTM) for estimating lake skin temperature based on a history of weather and bulk temperature predictions.
The particle filter provides a relatively small improvement to the model predictions (in contrast to other popular data assimilation schemes) but at no cost to the quality of the physical model. However, this approach requires a highly robust hydrodynamic model, as its corrective powers are limited. Despite the improvements, this approach is quite computationally costly, especially as a tool for inferring model parameters. In addition, as discussed in the Supplement, the sampler has significant difficulty with calibrating certain model parameters (although this issue could potentially be mitigated with a better error model). Therefore, we feel that a computationally cheaper method for parameter estimation   (for example, an optimization algorithm instead of a sampler) might be the more productive approach. At the same time, an improved version of the particle filtering approach could provide a powerful option for operational forecasting models.
Code and data availability. The code used for these simulations, and an example portion of the data, are openly available. The SPUX source code is available at https://doi.org/10.5281/zenodo.5638313 , the modified MITgcm repository can be found at https://doi.org/10.5281/zenodo.5634042 (Campin et al., 2021), and the repository for handling MITgcm runs is at https://doi.org/10.5281/zenodo.5637216 . While we cannot openly publish the forcing data, a representative snapshot is available at https://renkulab.io/gitlab/artur. safin/datalakes-observational-data-snapshot (last access: 30 October 2021), and the 3D mean temperature and velocity predictions can be accessed at https://doi.org/10.5281/zenodo.5642898 (Safin et al., 2019). Finally, the reproducibility of the computational environment is enabled through RENKU (Swiss Data Science Center, 2021), an online platform for the storage, tracking, and replication of numerical codes, which enables users to launch the container directly from the RENKU site. The entire SPUX-MITgcm repository and the computational environment is available as a docker