Articles | Volume 19, issue 7
https://doi.org/10.5194/gmd-19-2657-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
Validation strategies for deep learning-based groundwater level time series prediction using exogenous meteorological input features
Download
- Final revised paper (published on 07 Apr 2026)
- Preprint (discussion started on 08 Sep 2025)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-3539', Anonymous Referee #1, 27 Sep 2025
- AC1: 'Reply on RC1', Fabienne Doll, 18 Nov 2025
-
RC2: 'Comment on egusphere-2025-3539', Anonymous Referee #2, 04 Oct 2025
- AC2: 'Reply on RC2', Fabienne Doll, 18 Nov 2025
-
RC3: 'Comment on egusphere-2025-3539', Anonymous Referee #3, 06 Oct 2025
- AC3: 'Reply on RC3', Fabienne Doll, 18 Nov 2025
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
AR by Fabienne Doll on behalf of the Authors (16 Dec 2025)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (30 Dec 2025) by Dan Lu
RR by Anonymous Referee #3 (11 Jan 2026)
ED: Publish as is (27 Jan 2026) by Dan Lu
AR by Fabienne Doll on behalf of the Authors (10 Feb 2026)
Manuscript
Post-review adjustments
AA – Author's adjustment | EA – Editor approval
AA by Fabienne Doll on behalf of the Authors (10 Feb 2026)
Author's adjustment
Manuscript
EA: Adjustments approved (10 Feb 2026) by Dan Lu
0. This study aims to examine performance evaluation methods for ML-based groundwater-level prediction. Based on the abstract, the main contributions of the study appear to be twofold: (1) the use of time-lagged meteorological variables together with non-time-lagged groundwater data for groundwater-level prediction, and (2) the application of different performance evaluation strategies, including blocked 5-fold cross-validation (bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS). Having read through the manuscript, I find this to be a commendable effort that provides a thoughtful and well-executed case study on groundwater-level prediction.
1. However, my main concern—and hence the principal weakness of the study—is that, although the manuscript persuasively situates itself within the existing literature through comparisons and contrasts, its scope is constrained by the exclusive reliance on a 1-D CNN model. This raises the question of whether the conclusions regarding evaluation methods are fully justifiable, transferable, and robust across other ML/DL approaches. In addition, the absence of a benchmark comparison for the proposed approach further undermines the strength of the study.
2. I appreciate that the authors have already defined a clear research question for the current study. However, I would still encourage including more discussion of related research to better inform and engage GMD’s broad and sophisticated readership. For example, it would be valuable to compare your findings with those of Shen et al. (2022), highlighting in what ways your results are consistent or divergent. In addition, while the manuscript presents a strong stance on the time-consecutive hypothesis, it would also be helpful to discuss the perspective proposed by Zhang et al. (2022), who emphasize the importance of distributional representativeness across train/validation/test sets and suggest that this consideration may reduce the need for k-fold cross-validation. Including such comparisons would strengthen the manuscript by situating your contribution more clearly within the broader context of current research.
3. Given that I do not have prior experience in groundwater-level modeling, I find that the opening paragraph of the introduction does not clearly convey the nature of the problem. It remains unclear whether the study is addressing point prediction, area-averaged prediction, or image-type prediction on structured or unstructured grids. In addition, it would be helpful to explain what the common problem setups have been in previous research along this line, so that readers without domain expertise can better situate the present study.
4. At the end of the Introduction, please provide a clear definition of what is meant by stationary and non-stationary conditions in the context of this study. Since these concepts can depend on the choice of window size, it would be helpful if you could explicitly state how they are defined here.
5. In the Theory and Background section, you discuss each evaluation method individually. However, some of this information was already mentioned in the Introduction. While the content itself is sound, I recommend further polishing to reduce redundancy and improve delineation and readability.
6. Around line 110, the manuscript states: “…For time series prediction, random shuffling of the data is often considered problematic as it can break the temporal dependency of the data…”. I would like to clarify whether this claim is model-dependent. For instance, is this necessarily true when using tree-based algorithms such as Random Forests, which do not rely on temporal (Markovian) state updates? While sequential models that rely on temporal dependencies (e.g., autoregressive or state-space models) may indeed be affected, models that only map input–output relationships may not experience the same issue. Could you clarify whether this limitation arises primarily from the choice of model, rather than from the evaluation method itself?
7. Around line 175, you refer to the concept of weak stationarity. Could you please clarify what window size is being used to assess this property? Since the definition of weak stationarity can depend on the temporal window considered, this specification would help readers interpret the results correctly.
8. For section 3.3, I wonder if the use of “dropout” would affect the results of using different evaluation method.
9. If I understand correctly, the authors use 80% of the in-set data for model development. Have you evaluated whether a smaller subset of this 80% could achieve comparable accuracy and robustness, and, if so, what the minimum percentage might be? Additionally, would reducing the total amount of data alter the study’s conclusions?
10. Figures 5–7 provide a reasonable and effective way of summarizing the results. That said, are there additional quantitative approaches that could be used to present the findings on spatial maps? Moreover, beyond the stationarity perspective, could further insights be derived in terms of predictive accuracy that would enrich the interpretation of the results?
11. The manuscript fixes the input meteorological sequence length at 52 weeks. Please clarify the basis for this choice. Have alternative horizons been tested? Should the optimal horizon be constant across sites, or might it vary with hydro-geo-climatic setting? If not constant, what insights can be derived from treating this as a site-specific (or region-specific) hyperparameter?
References:
Shen, H., Tolson, B.A. and Mai, J., 2022. Time to update the split‐sample approach in hydrological model calibration. Water Resources Research, 58(3), p.e2021WR031523.
Zheng, F., Chen, J., Maier, H.R. and Gupta, H., 2022. Achieving robust and transferable performance for conservation‐based models of dynamical physical systems. Water Resources Research, 58(5), p.e2021WR031818.