Validation strategies for deep learning-based groundwater level time series prediction using exogenous meteorological input features

Doll, Fabienne; Liesch, Tanja; Wetzel, Maria; Kunz, Stefan; Broda, Stefan

doi:10.5194/gmd-19-2657-2026

Articles | Volume 19, issue 7

https://doi.org/10.5194/gmd-19-2657-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-19-2657-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 19, issue 7

Methods for assessment of models

|

07 Apr 2026

Methods for assessment of models |

| 07 Apr 2026

Validation strategies for deep learning-based groundwater level time series prediction using exogenous meteorological input features

Fabienne Doll, Tanja Liesch, Maria Wetzel, Stefan Kunz, and Stefan Broda

Download

Final revised paper (published on 07 Apr 2026)
Preprint (discussion started on 08 Sep 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-3539', Anonymous Referee #1, 27 Sep 2025

0. This study aims to examine performance evaluation methods for ML-based groundwater-level prediction. Based on the abstract, the main contributions of the study appear to be twofold: (1) the use of time-lagged meteorological variables together with non-time-lagged groundwater data for groundwater-level prediction, and (2) the application of different performance evaluation strategies, including blocked 5-fold cross-validation (bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS). Having read through the manuscript, I find this to be a commendable effort that provides a thoughtful and well-executed case study on groundwater-level prediction.
1. However, my main concern—and hence the principal weakness of the study—is that, although the manuscript persuasively situates itself within the existing literature through comparisons and contrasts, its scope is constrained by the exclusive reliance on a 1-D CNN model. This raises the question of whether the conclusions regarding evaluation methods are fully justifiable, transferable, and robust across other ML/DL approaches. In addition, the absence of a benchmark comparison for the proposed approach further undermines the strength of the study.
2. I appreciate that the authors have already defined a clear research question for the current study. However, I would still encourage including more discussion of related research to better inform and engage GMD’s broad and sophisticated readership. For example, it would be valuable to compare your findings with those of Shen et al. (2022), highlighting in what ways your results are consistent or divergent. In addition, while the manuscript presents a strong stance on the time-consecutive hypothesis, it would also be helpful to discuss the perspective proposed by Zhang et al. (2022), who emphasize the importance of distributional representativeness across train/validation/test sets and suggest that this consideration may reduce the need for k-fold cross-validation. Including such comparisons would strengthen the manuscript by situating your contribution more clearly within the broader context of current research.
3. Given that I do not have prior experience in groundwater-level modeling, I find that the opening paragraph of the introduction does not clearly convey the nature of the problem. It remains unclear whether the study is addressing point prediction, area-averaged prediction, or image-type prediction on structured or unstructured grids. In addition, it would be helpful to explain what the common problem setups have been in previous research along this line, so that readers without domain expertise can better situate the present study.
4. At the end of the Introduction, please provide a clear definition of what is meant by stationary and non-stationary conditions in the context of this study. Since these concepts can depend on the choice of window size, it would be helpful if you could explicitly state how they are defined here.
5. In the Theory and Background section, you discuss each evaluation method individually. However, some of this information was already mentioned in the Introduction. While the content itself is sound, I recommend further polishing to reduce redundancy and improve delineation and readability.
6. Around line 110, the manuscript states: “…For time series prediction, random shuffling of the data is often considered problematic as it can break the temporal dependency of the data…”. I would like to clarify whether this claim is model-dependent. For instance, is this necessarily true when using tree-based algorithms such as Random Forests, which do not rely on temporal (Markovian) state updates? While sequential models that rely on temporal dependencies (e.g., autoregressive or state-space models) may indeed be affected, models that only map input–output relationships may not experience the same issue. Could you clarify whether this limitation arises primarily from the choice of model, rather than from the evaluation method itself?
7. Around line 175, you refer to the concept of weak stationarity. Could you please clarify what window size is being used to assess this property? Since the definition of weak stationarity can depend on the temporal window considered, this specification would help readers interpret the results correctly.
8. For section 3.3, I wonder if the use of “dropout” would affect the results of using different evaluation method.
9. If I understand correctly, the authors use 80% of the in-set data for model development. Have you evaluated whether a smaller subset of this 80% could achieve comparable accuracy and robustness, and, if so, what the minimum percentage might be? Additionally, would reducing the total amount of data alter the study’s conclusions?
10. Figures 5–7 provide a reasonable and effective way of summarizing the results. That said, are there additional quantitative approaches that could be used to present the findings on spatial maps? Moreover, beyond the stationarity perspective, could further insights be derived in terms of predictive accuracy that would enrich the interpretation of the results?
11. The manuscript fixes the input meteorological sequence length at 52 weeks. Please clarify the basis for this choice. Have alternative horizons been tested? Should the optimal horizon be constant across sites, or might it vary with hydro-geo-climatic setting? If not constant, what insights can be derived from treating this as a site-specific (or region-specific) hyperparameter?
References:

Shen, H., Tolson, B.A. and Mai, J., 2022. Time to update the split‐sample approach in hydrological model calibration. Water Resources Research, 58(3), p.e2021WR031523.
Zheng, F., Chen, J., Maier, H.R. and Gupta, H., 2022. Achieving robust and transferable performance for conservation‐based models of dynamical physical systems. Water Resources Research, 58(5), p.e2021WR031818.

Citation: https://doi.org/10.5194/egusphere-2025-3539-RC1
- AC1: 'Reply on RC1', Fabienne Doll, 18 Nov 2025
  
  We would like to thank reviewer 1 for the thorough and constructive review of our manuscript.
  
  A detailed response can be found in the attached PDF file (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2025-3539-AC1
RC2:
'Comment on egusphere-2025-3539', Anonymous Referee #2, 04 Oct 2025

I Cannot find the importance of the presents study and how it contributes to the improvement of our knowledge in terms of GWL prediction using machine learning. Predicting GWL using one deep learning model (the CNN) is not new and the fact that the authors propose a modelling strategy based only on exogenous variables is no as important to be proposed and presented as an innovative approach. Furthermore, the fact that three different evaluation strategy, i.e., blocked cross-validation (5 bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS) are compared is not a solid argument to justify the importance and novelty of the present paper. Yet, a modelling strategy based only on one ML model is extremely unsound as there is no any baseline of comparison. The adoption of weekly data is not justified and for closing, section results is extremely poor and unsound. There is no any interpretability of the model and a ranking of the features based on their contribution to the final model response.

Citation: https://doi.org/10.5194/egusphere-2025-3539-RC2
- AC2: 'Reply on RC2', Fabienne Doll, 18 Nov 2025
  
  We would like to thank reviewer 2 for reviewing our manuscript.
  
  A detailed response can be found in the attached PDF file (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2025-3539-AC2
RC3:
'Comment on egusphere-2025-3539', Anonymous Referee #3, 06 Oct 2025

This paper reviews and analyzes the validation strategies of time series deep learning models. They classify the metric performance assessment approaches as three groups: out-of-sample validation (OOS), blocked cross-validation (bl-CV), and repeated out-of-sample validation (repOOS). Subsequently, they establish one Dimension Convolutional Neural Networks (1D-CNN) model considering exogenous meteorological inputs with time lags for groundwater level (GWL) prediction. And then, a data set of 100 GWL time series (including 50 stationary and 50 nonstationary time series) in Brandenburg, Germany, are used to assess the validation strategies. Finally, they confirm that bl-CV and repOOS provide the most representative performance estimates for stationary and nonstationary GWL data, respectively. This paper more likes a review of validation strategies, and lacks deep analysis, especially for hydrogeological conditions.
1. Please clarify the contribution of this paper to hydrology.
2. Did you consider alteration of the loss functions in the training period to identify suitable hyper parameters for improving the model performance?
3. The better actual model performance of GWL should not only have the small APAE, but also reflect the heterogeneity of aquifers (e.g., response time of GWL to meteorological factors, and amplitude).
4. A systematical analysis the hydrogeological conditions of study area are needed (e.g., how many layers of aquifers, which layers the 100 GWL wells located, and groundwater pumping rates), which can help us figure out the best performance model.

Citation: https://doi.org/10.5194/egusphere-2025-3539-RC3
- AC3: 'Reply on RC3', Fabienne Doll, 18 Nov 2025
  
  We would like to thank reviewer 3 for the thorough review of our manuscript.
  
  A detailed response can be found in the attached PDF file (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2025-3539-AC3

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Fabienne Doll on behalf of the Authors (16 Dec 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (30 Dec 2025) by Dan Lu

RR by Anonymous Referee #3 (11 Jan 2026)

ED: Publish as is (27 Jan 2026) by Dan Lu

AR by Fabienne Doll on behalf of the Authors (10 Feb 2026) Manuscript

Post-review adjustments

AA – Author's adjustment | EA – Editor approval

AA by Fabienne Doll on behalf of the Authors (10 Feb 2026) Author's adjustment Manuscript

EA: Adjustments approved (10 Feb 2026) by Dan Lu

Short summary

With the growing use of machine learning for groundwater level (GWL) prediction, proper performance estimation is crucial. This study compares three validation strategies—blocked cross-validation (bl-CV), repeated out-of-sample (repOOS), and out-of-sample (OOS)—for 1D-CNN and LSTM models using meteorological inputs. Results show that bl-CV offers the most reliable performance estimates, while OOS is the most uncertain, highlighting the need for careful method selection.