Prediction of algal blooms via data-driven machine learning models: an evaluation using data from a well-monitored mesotrophic lake

. With increasing lake monitoring data, data-driven machine learning (ML) models might be able to capture the complex algal bloom dynamics that cannot be completely described in process-based (PB) models. We applied two ML models, the gradient boost regressor (GBR) and long short-term memory (LSTM) network, to predict algal blooms and seasonal changes in algal chlorophyll concentrations (Chl) in a mesotrophic lake. Three predictive workﬂows were tested, one based solely on available measurements and the others applying a two-step approach, ﬁrst estimating lake nutrients that have limited observations and then predicting Chl using observed and pre-generated environmental factors. The third workﬂow was developed using hydrodynamic data derived from a PB model as additional training features in the two-step ML approach. The performance of the ML models was superior to a PB model in predicting nutrients and Chl. The hybrid model further improved the prediction of the timing and magnitude of algal


Section S1
Monitoring methods used at Lake Erken A meteorological station on an island offshore from Uppsala University's Erken Laboratory provides measurements of wind speed, solar radiation, and air temperature.An automated water temperature monitoring system records water temperature profiles at a depth of 15 m with sensors placed at 0.5 m intervals.Water discharge is measured entering the lake from the largest input at Kristineholm, and the outflow at Stensta (Fig. 1).These data have been further quality controlled and combined with data from other nearby meteorological stations to provide a long-term dataset that is suitable as input for model simulations (Moras et al., 2019).Since 1991, a consistent (1-2 week) monitoring program has collected integrated water samples from the epilimnion and hypolimnion during stratified conditions or from the entire water column during isothermal conditions.Stream samples are collected from the main inflow at Kristineholm and the outflow of the lake.All samples are analyzed by the Erken Laboratory for all major nutrient concentrations (e.g., NOX, NH4, PO4, Total P, Si, etc.), dissolved oxygen (O2), and Chl concentration.Water and nutrient loads input to the GOTM/SELMAPROTBAS model were calculated from the discharge and nutrient concentrations measured at Kristineholm (Fig. 1) which accounted for 50.7 % of the lake watershed.Inputs from the remaining watershed were estimated from the measured Kristineholm inputs that were scaled by area to account for the remaining 49.3 % of the watershed area.Further details of meteorological and hydrological data processing can be found in Moras et al. (2019) and Mesman et al. (2022).

Section S2 Hyperparameters setting in LSTM
Essentially, the LSTM model defines a transition relationship for a hidden representation through a LSTM cell which combines the input features at each time step with the inherited information from previous time steps.We have tried different combination of numbers of layers and neurons (1-3 layers, 20-200 neurons), but larger numbers of layers and neurons did not obviously improve the results but increased the computational time a lot, and worse results were achieved when the number of layers and neurons were decreased.Eventually, we used 3 hidden LSTM layers with 100 neurons in each layer, and each of them is followed by a dropout layer with 0.01-0.03dropout rate for regularizing the network.The numbers of batch and epoch are set as 10 and 100, respectively.Thus, the training samples are divided into 10 batches, and the internal model parameters will update after working through one batch.And the deep learning algorithm will work through the entire training dataset 100 (epochs) times.The 'MinMaxScaler' was used to pre-process the data for generalization purposes, and 'Mean Absolute Error' was used as loss function.The model was set to carry on the memory of previous 7 days (time steps = 7).

Section S3 Calculations of hydrodynamic features
The mixing layer depth (ze) was computed using the GOTM simulated vertical eddy diffusivity (Kz) profiles, and was defined as the first depth, from the lake surface, where Kz fell below the predefined threshold value (Wilson et al., 2020), and can be describe as where zi and Kzi are the depth from the lake surface, and the eddy diffusivity, respectively, in the i th layer within the model.The threshold value Kz threshold was set to 5×10 -5 m 2 s -1 , based on the value described in Wüest and Lorke (2009) and Lin et al. (2021).
Unlike the dynamically varying mixing layer depth derived from the modelled Kz profiles, the calculation of the seasonal thermocline depth was estimated using Lake Analyzer (Read et al., 2011) based on the modelled temperature profile.A movement of thermocline can allow nutrient released from the sediment to enter the upper water column, leading to nutrient enrichment.It also can lead to resuspension of cells or dormant forms of cyanobacteria into the water column, encouraging bloom development (Reichwaldt and Ghadouani, 2012).
The Wedderburn number Wn, introduced by Thompson and Imberger (1980), is used to estimate the chance of upwelling occurring in the lake.It is written as is the reduced gravity due to the change in water density ∆ρ between the hypolimnion (ρh) and epilimnion (ρe).Ls is the lake fetch length (2700 m for Lake Erken) and u* is the wind stress induced water friction velocity, defined as where τw is the wind shear (N m -2 ) on the water surface, computed by τw = CD ρair U 2 .U is wind speed (m s -1 ) measured at 10 m above the water surface.CD is drag coefficient, given as Table S1.Confusion matrix and metrics based on it.

Figure S1 .
Figure S1.(a, b) Ice break-up dates and ice cover durations since 1975 (Part of data from Weyhenmeyer et al. 1999).The timing of spring bloom in Lake Erken defined by (c, d) maximum Chl peak, and (e, f) steepest daily change of Chl.

Figure S3 .
Figure S3.Comparison of Chl concentrations in every month over 2004-2018, the red and blue dots represent the data from 2019 and 2020, respectively.

Figure S4 .
Figure S4.Feature importance based on 'feature_importances_' function from GBR model in scenario 1.

Figure S5 .
Figure S5.(a) Timeseries of observed and predicted Chl from GBR and LSTM models in workflow 2, (b) scatter plots of observations vs GBR and LSTM models.Penal (c) shows the observed and predicted algal bloom onsets in 2017-2020 using the same color coding as the previous panels.Results from the PB model simulation in Mesman et al. (2022) are also shown in (a) and (c).

Figure S6 .
Figure S6.Timeseries of six observed and predicted nutrients (a) NOX, (b) PO4, (c) O2, (d) Total P, (e) NH4, (f) Si, at surface (-3 m) from GBR, LSTM in workflow 2 (W2) and 3 (W3), and PB models.The Si simulations in the PB model had not been optimized, so these are not shown in the figure.See the RMSEs (mmol/m 3 ) of 6 nutrient variables below,

Figure S7
Figure S7 Timeseries of observed and predicted Chl from GBR (panels on the left) and LSTM (panels on the right) models based on 7-day, 21-day, and 35-day sample intervals, via leave-four-year-out shuffling year test.Each row is a different 4-year period.

Figure S8
Figure S8 Timeseries of observed and predicted Chl from GBR (panels on the left) and LSTM (panels on the right) models based on 7-day, 21-day, and 35-day sample intervals, via leave-four-year-out shuffling year test (Same as Figure S6, but with different x-axis).

Figure S9 .
Figure S9.Timeseries of observed and predicted Chl from GBR (panels on the left) and LSTM (panels on the right) models based on 7-day, 21-day, and 35-day sample intervals, via leave-four-year-out shuffling year test (Same as Figure S6, but with different x-axis).

Figure S10 .
Figure S10.Timeseries of observed surface water temperature and difference between surface water (averaged over the upper 3 m) and bottom water (15 m).

Table S2
Comparisons of ML models' performance based on RMSE, MAE, and R2 in training dataset (via 5-fold cross validation) and testing dataset.

Table S3
Coefficient of variation of evaluating metrics in shuffling training years to test 2019-2020.

Table S4
Coefficient of variation of MAEs, RMSEs, and TPRs in shuffling year data sparsity test.