Assessing seasonal climate predictability using  a deep learning application: NN4CAST

Galván Fraile, Víctor; Rodríguez-Fonseca, Belén; Polo, Irene; Martín-Rey, Marta; Moreno-García, María N.

doi:10.5194/gmd-19-1917-2026

Articles | Volume 19, issue 5

https://doi.org/10.5194/gmd-19-1917-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-19-1917-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 19, issue 5

Model description paper

|

06 Mar 2026

Model description paper |

| 06 Mar 2026

Assessing seasonal climate predictability using a deep learning application: NN4CAST

Víctor Galván Fraile, Belén Rodríguez-Fonseca, Irene Polo, Marta Martín-Rey, and María N. Moreno-García

Download

Final revised paper (published on 06 Mar 2026)
Supplement to the final revised paper
Preprint (discussion started on 18 Jul 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-3162', Anonymous Referee #1, 26 Aug 2025
General Comments:
The authors provide a tool that may be utilized to research seasonal predictability using basic deep learning methods. The code library provides a pipeline to preprocess data, train the model, evaluate, and calculate some metrics/attributions, based on a user-defined namelist and input files. Although the model will not achieve state-of-the-art skill, it does have potential for mechanistic studies through explainable AI. However, I do not believe the manuscript in its current state effectively communicates this message.
1. The analysis of the teleconnection between DJF Pacific tropical SST and MAM tropical Atlantic SST and related evaluation of the model is not valid, due to the region of the input predictor field, which includes parts of the western tropical Atlantic. Looking through the individual Integrated*Gradient attribution samples on Zenodo, it is clear that the largest attributions are most often in this area, rather than in the tropical Pacific. This is also confirmed by calculating correlations between areal-averaged SST in the target WTNA or SMSCU region with the input SST field. This leads to unrealistic, inflated skill in Figure 2, which is a result of the inclusion of the west Atlantic in the input fields, rather than the Pacific-Atantic teleconnection, as stated in the text (line 266-267).
2. The discussion surrounding XAI in Figure 3 is unconvincing. Although the model attribution plot (Fig 3c) shows more spatial variability than the simple regression (Fig 3e), this does not necessarily mean there is added value. The work would benefit from further exploring the physical mechanisms associated with the Integrated Gradients attribution. There is not a clear connection between the spatial variance in Fig 3c and the citation of Wade et al. 2023 in the text. How much does the attribution pattern change with different initial seeds? What is the sample size? There is only a ~100 year record that is being used, with even fewer El Niño’s, so I am skeptical of the robustness of model attribution. Have you tried calculating attribution plots, compositing on a warm WTNA or SMSCU, rather than ENSO?
3. The analysis of European precipitation is useful for showing how the predictability varies between different periods. However, the regression analysis in Figure 6 is a little confusing, as you could perform the exact same regression with only observational data, yielding more faithful results and yielding the same conclusion regarding ENSO and European precipitation. Figure 5 shows the model can reproduce some of the same trends as observations, but doesn’t reveal any new insights not available from solely observations.
Similarly to the previous analysis, it does not seem like the model is directly capturing a connection between ENSO and European precipitation, based on the individual attribution plots on Zenodo, which mostly show the model thinks SST anomalies in the extratropical Pacific and Atlantic Ocean are important. What could maybe be useful is to look at the attribution plots for precipitation in skillful regions during 1942-1969? Maybe there is a change in the background state (e.g. the extratropical jet), which changes the propagation of the extratropical Rossby wavetrains that affect European precipitation and thus predictability?
4. In the introduction it is stated that “The idea behind NN4CAST is to mitigate the risk of treating deep learning methods as “black boxes”, thereby enabling users to identify sources of predictability and assess the sensitivity of predictions to variations in the training period and/or to the predictor region.” (line 80). However, the current manuscript does not really analyze the sensitivity to the training period or predictor region.
Specific comments:
The description of how the ACC is calculated could be a little more clear on what dimension is being averaged over, spatially or temporally. For when it is spatial, it is also typical that an areal weighting is applied to account for latitudinal variations in grid area.

The different Listing’s showing the python code are probably somewhat redundant. It would be more useful to show what architecture is implemented, which is not easily derived from the text. For example, there is an option for convolutional layers, but how is this implemented alongside the option for dense layers?

Figure 4 and 6 colorbar is not uniformly spaced. Most of the values are near 0 or in the 0.2-0.4 range. Would be better to have separate colors for 0.2-0.3 and 0.3-0.4, to evaluate the skill.

It is stated that linear statistical models are weakened by limited observational record and nonstationarity (line 369). However, it should be clear these are also limitations for the deep learning model.

The subpanel titles in figures 3 and 6 are not clear. For example, “Regression predicted TNA on Niño years” could be something like “Input SST regressed against TNA index during El Niño”
Citation: https://doi.org/10.5194/egusphere-2025-3162-RC1
- AC1: 'Reply on RC1', Víctor Galván Fraile, 20 Oct 2025
  
  Dear Reviewer,
  Thank you for your valuable and insightful feedback. We appreciate the time and effort you put into reviewing our work. Please find our detailed responses to your comments in the attached PDF document.
  Best regards,
  
  Víctor Galván and Co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3162-AC1
RC2:
'Comment on egusphere-2025-3162', Anonymous Referee #2, 09 Sep 2025
The manuscript presents NN4CAST, a Python framework intended to streamline seasonal predictability studies with deep learning. The pipeline covers data preprocessing (region/season selection, anomaly computation, trend removal), model construction with regularization, cross-validation and tuning, and interpretation via an XAI module and EOF analysis. Two case studies are used to illustrate skill: Pacific SST forcing of tropical North Atlantic (TNA) SST in boreal spring, and Pacific SST forcing of European autumn precipitation. The overall aim to facilitate testing sources of predictability and attributing predictions to input regions is very relevant to climate services, but the manuscript in its current form requires revision before it is suitable for publication.
General comments
The first case study (DJF tropical “Pacific” predictors-> MAM TNA SST) formally respects the lag, yet the predictor domain extends into the western tropical Atlantic. Given the well-known persistence of tropical Atlantic SST, even a narrow DJF Atlantic band can carry substantial memory into MAM and thus contribute to the high ACC shown in Fig. 2. In that sense, part of the reported skill may reflect local persistence rather than a Pacific-forced bridge. It would be helpful to clarify whether masking local Atlantic SST alters the ACC/RMSE/importance patterns.

In Fig. 3, the comparison between the regression composite for El Niño years (predicted TNA) and the importance composite should be improved. The fact that the attribution map contains cooling features does not, by itself, demonstrate added value, like mentioned in L280. It would help to explain how their sign and placement align with the atmospheric bridge and Wind–Evaporation–SST mechanism (e.g., stronger trades, surface heat-flux anomalies, wind-stress curl…), and whether the lead–lag structure supports that interpretation. As presented, it is difficult to separate a genuine teleconnection signal from collinearity in the SST field or residual Atlantic persistence. Showing that attribution hotspots co-locate with observed flux/SLP/wind anomalies, and repeating the analysis with the Atlantic belt removed from the predictors, would clarify whether the cooling patterns reflect a physical mechanism or a model artifact.

The manuscript often uses the language of “drivers,” yet the analysis is primarily associational. This matters for teleconnections, where shared low-frequency covariates can produce strong correlations without isolating a pathway. The discussion around XAI (e.g., Fig. 3) therefore could be improved. For both applications, it would be helpful to control for NAO variability and check how strongly it modulates the two teleconnections, and how this impacts the attribution maps. The recent literature arguing for causal-inference tools in teleconnection analysis points in this direction and could be useful (e.g. https://journals.ametsoc.org/view/journals/bams/102/12/BAMS-D-20-0117.1.xml).

The manuscript describes the toolkit as “versatile,” yet for identifying dominant spatial modes it offers only EOF analysis of model outputs versus observations. For teleconnection work, this is a narrow diagnostic. At minimum, a versatile layer would include a menu of spatial-mode tools beyond EOF (e.g. maximal covariance or canonical correlation analysis for coupled patterns). Please consider either expanding the diagnostics accordingly or reframing the package as a DL-first pipeline with basic (EOF-based) spatial diagnostics.

Specific comments
L1 (...) with the changes in tropical sea surface temperatures (SST) being (...)
L8 Please be more specific than writing “(...) performs all the methodological steps”.
L27 you already defined SST in the abstract. If you decide to define again, please use lower case as in the abstract.
L75-77 Here you introduce the tool for the first time, after a long introduction on seasonal forecasting and ML. I suggest bringing up the goal/what’s new about your paper much earlier in the introduction, to help the reader to situate themselves.
L86 I don’t think bringing up the possibility to combine NN4CAST with ESMValTool in the introduction is relevant. I think this could be mentioned in the conclusions/future work.
L89 Please give an example of such tools written in C/C++.
L92-103 I recommend not giving so many details of the applications to be analysed in this final introduction paragraph. Similarly, mentioning GitHub and code availability here seems misplaced.
L337 I think the sentence should be rewritten, as significant skill is not found in most of the European continent, rather in parts of it.
L375 and L380 are repeated
L380, L395 The authors mention that the tool has a primary application to identify windows of opportunity (WoO). However, in the two applications given, there was no framing related to WoO. I recommend improving the discussion towards the context of WoO.
L388-L389 I suggest to focus on more specific advantages offered by the NN4CAST in your conclusions. “These complementary approaches offer valuable contributions to the scientific community and support the improvement of current seasonal forecasting systems” seems a bit vague and exaggerated at the same time. In particular for the first application, the authors did not go in depth to highlight any new insights concerning the teleconnection, rather used it as an example to illustrate what the tool does.
Citation: https://doi.org/10.5194/egusphere-2025-3162-RC2
- AC2: 'Reply on RC2', Víctor Galván Fraile, 20 Oct 2025
  
  Thank you for your valuable and insightful feedback. We appreciate the time and effort you put into reviewing our work. Please find our detailed responses to your comments in the attached PDF document.
  Best regards,
  
  Víctor Galván and Co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3162-AC2
RC3:
'Comment on egusphere-2025-3162', Anonymous Referee #3, 22 Sep 2025
The paper introduces NN4CAST, a Python-based framework designed to identify and investigate drivers of seasonal climate predictability. It shows that NN4CAST provides explainability by attributing predictions to specific regions of the chosen predictor field, thereby quantifying the relative importance of different sources of predictability.
The paper addresses an interesting problem and proposes a framework for understanding sources of predictability. However, the manuscript currently lacks details on the method and justification of key choices, as well as on the interpretation of XAI results to make the framework truly useful for climate services and science. in the perspective of this reviewer, the framework as well as the examples chosen to illustrate its usefulness would benefit from some reconsideration prior to possible resubmission.
General comments
The method chosen to make the predictions is not discussed or justified in the paper. Why is an autoencoder architecture chosen in the first example? It should definitely be discussed whether this makes a difference to the regions identified by the XAI method? Given the short observational record and non-stationarity of the teleconnections, can a deep learning approach always be justified compared to a regularized regression?
This reviewer agrees with the two other reviewers that tropical Atlantic should not be included in predictor region in the first example.
Parts of the paper read a lot like a Python package documentation rather than a method or framework description (for example lines 156-164, Table 1, Listing 1-3). Since the paper is presenting a framework and not a package, this reviewer thinks that they might be better suited in the Appendix or Supplementary Material. In particular, the paper contains no details or discussion on the choice of deep learning method, which should be included in the main text - perhaps at the expense of the code description.
In further agreement with the other reviewers, the results presented in Figure 3 c and d do not seem particularly convincing to this reviewer, and do not seem to highlight the value of model-based attributions. In the eyes of this reviewer, the composite importances identified by the XAI methods have very low amplitudes and don't show physically interpretable structure or coherence. How would the authors explain this? Furthermore, why is the data first filtered for El-Niño events, and how is the threshold chosen?
More specific comments
Line 8: What do the authors mean by the 'original files'? Especially since this is in the abstract, a more specific term should be chosen.
Line 59: It should be noted that this paragraph talks about AI models at weather timescales.
Line 67: "The use of DL models to assess seasonal forecast is not so common" - Aside from the spelling error, this statement is very vague. Given the vast emerging literature on deep learning for seasonal forecasting, examples should be cited here, or the sentence should more specifically say what DL models have not been used for.
Line 132: It would be valuable to state why this method is chosen over others.
Line 133: "This method addresses the issue of non-linear problems, where the derivative of the output with respect to the inputs is not constant." This sentence is a bit too vague and slightly misleading - other XAI methods address non-linear problems as well, and Integrated Gradients can be applied to linear problems as well.
Line 167: It is unclear to this reviewer what bullet point one intends to state. Furthermore, points 1-4 would be addressed by a regularized linear regression model as well - it would be valuable to include in this list why a deep learning approach is chosen here.
This reviewer is a non-English native speaker and appreciates the difficulties in writing in a second language. However, the paper would benefit from grammatical corrections, including but not limited to the following:
Line 1: 'being the changes in tropical sea surface temperature the most influential drivers'

Line 190 "By this way it avoids to introduce"
Citation: https://doi.org/10.5194/egusphere-2025-3162-RC3
- AC3: 'Reply on RC3', Víctor Galván Fraile, 20 Oct 2025
  
  Thank you for your valuable and insightful feedback. We appreciate the time and effort you put into reviewing our work. Please find our detailed responses to your comments in the attached PDF document.
  Best regards,
  
  Víctor Galván and Co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3162-AC3

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Víctor Galván Fraile on behalf of the Authors (20 Oct 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (21 Oct 2025) by Di Tian

RR by Anonymous Referee #2 (21 Nov 2025)

Suggestions for revision or reasons for rejection

I appreciate the authors’ careful revisions, which satisfactorily address my 1st, 2nd, and 4th general concerns. However, aspects of my 3rd general concer–relating to the distinction between associational and causal language in the teleconnection analysis–remain only partially resolved.

The revisions addressing the causal-language issue are appreciated, and the shift toward more cautious terminology (e.g., “potential drivers”) improves the framing. However, parts of the response still lean on mechanistic interpretations that exceed what can be inferred from the associational analyses. For example, describing the DJF-MAM lag as “indicating causality,” or asserting that the NAO component captured by the model is “likely the SST-forced part,” goes beyond what is formally demonstrated, as NAO is not explicitly controlled for within the prediction framework.

In addition, the manuscript relies on conditional relationships (e.g., ENSO-NAO frequencies, period-dependent correlations, NAO-stratified composites) to infer mechanism–for instance, suggesting that NAO is “more internally driven” in P2 or “more strongly forced by SST anomalies” in P3, or interpreting attribution maps as identifying regions physically “contributing” to precipitation variability. These diagnostics are informative but remain descriptive; they do not isolate NAO as a confounder, nor do they establish distinct physical pathways beyond the SST-precipitation composites and the observed covariation with ENSO.

No additional analyses are needed. A short clarification in the manuscript would fully resolve the issue: namely, to state explicitly that the lagged correlations, conditional diagnostics, and attribution maps reveal associational patterns rather than isolated causal mechanisms, and that NAO-related variability is not independently controlled for within the ML framework. Adding this caveat where mechanism is discussed (e.g., in the P2–P3 comparison and the interpretation of attribution maps) would prevent any overinterpretation while keeping the scientific message intact.

Referee Report: PDF

Hide

RR by Anonymous Referee #4 (01 Jan 2026)

Suggestions for revision or reasons for rejection

Assessing seasonal climate predictability using a deep learning application: NN4CAST

Authors: Victor et al.

Remarks: Authors proposed a framework for assessing seasonal climate predictability using deep learning. They have discussed two case studies with their proposed framework. Overall, the work is good and I think it has potential to be published. However, there are some major issues in the present form, which requires clarifications. Please see below some major comments:

Major Comments:

1- Fig. 2 and 4: Authors have discussed Anomaly correlation and RMSE maps and they have discussed their formulation. However, I do not understand how they have estimated a time series of ACC and RMSE (shown in Fig. 2). According to eq. (2), m sample and I guess sample here is meant to be total time period. So, using eq. (2), spatial anomaly correlation map is right field and map, but how they have estimated time series, that not clear. If authors meant pattern correlation, which is different from anomaly correlation? Please clarify, what is ACC and RMSE time series meaning?
2- Please see following literature, which could be relevant (https://www.nature.com/articles/s41612-025-01198-3)
3- Fig 5: The predicted precipitation pattern is based on the data driven forecast model. Why authors are applying EOF on their predicted rainfall? If they do not apply EOF, then what is the prediction skill of the predicted precipitation pattern compared to the observations (without EOF)? i strongly suggest to please show the skill of the predicted precipitation pattern comparing with observation.
4- Fig. 2: I see authors have mentioned about statistical significance of the maps, but that is not clear.
5- Which precipitation dataset is adopted as observations, used in Fig. 5?
6- Authors are claiming this is an explainable AI approach, which is not very much. How they are demonstrating that the approach they have adopted is more like explainable AI approach? Please explain clearly. Can authors explain the results in eq. (3) framework?

Hide

ED: Reconsider after major revisions (04 Jan 2026) by Di Tian

AR by Víctor Galván Fraile on behalf of the Authors (23 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (23 Jan 2026) by Di Tian

RR by Anonymous Referee #4 (11 Feb 2026)

ED: Publish subject to minor revisions (review by editor) (12 Feb 2026) by Di Tian

AR by Víctor Galván Fraile on behalf of the Authors (18 Feb 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (19 Feb 2026) by Di Tian

AR by Víctor Galván Fraile on behalf of the Authors (20 Feb 2026) Manuscript

Short summary

We present a new deep learning framework designed to assess seasonal climate predictability by identifying the key predictors that influence climate variability across different regions. This tool enhances understanding of how remote areas are connected through climate interactions and providing accurate and explainable seasonal predictions. Our results demonstrate its potential to support more reliable and informed climate services at both regional and global scales.