H2CM (v1.0): hybrid modeling of global water–carbon cycles constrained by atmospheric and land observations

Baghirov, Zavud; Reichstein, Markus; Kraft, Basil; Ahrens, Bernhard; Körner, Marco; Jung, Martin

doi:10.5194/gmd-19-4467-2026

Articles | Volume 19, issue 10

https://doi.org/10.5194/gmd-19-4467-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-19-4467-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 19, issue 10

Model description paper

|

27 May 2026

Model description paper |

| 27 May 2026

H2CM (v1.0): hybrid modeling of global water–carbon cycles constrained by atmospheric and land observations

Zavud Baghirov, Markus Reichstein, Basil Kraft, Bernhard Ahrens, Marco Körner, and Martin Jung

Download

Final revised paper (published on 27 May 2026)
Preprint (discussion started on 11 Jul 2025)

Interactive discussion

Status: closed

CEC1:
'Comment on egusphere-2025-3123 - No compliance with the policy of the journal', Juan Antonio Añel, 28 Jul 2025

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, you have not shared the input data used in your work, both for simulations and comparisons. It is necessary to you share such data to ensure the replicability of your work.
Also, you have not shared the full output of your simulations, but aggregated monthly data. You must share the full daily data resulting from your simulations.
Therefore, please, publish the mentioned data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Also, please, remember including a modified 'Code and Data Availability' section in any potentially reviewed manuscript, containing the information of the new repositories.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2025-3123-CEC1
- AC1:
  'Reply on CEC1', Zavud Baghirov, 30 Jul 2025
  Dear Juan A. Añel,
  Thank you very much for pointing this out.
  Please find below the links to the relevant datasets:
  
  H2CM – Model inputs and targets (e.g., constraints):
  
  https://doi.org/10.5281/zenodo.16575309
  
  H2CM – Daily simulations (carbon and water cycle parameters):
  
  https://doi.org/10.5281/zenodo.16572166
  
  We will ensure that this data is properly referenced in the Code and Data Availability section of the potentially revised version.
  Best regards,
  
  Zavud Baghirov
  
  Citation: https://doi.org/10.5194/egusphere-2025-3123-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 31 Jul 2025
    
    Dear authors,
    Many thanks for addressing this issue so quickly. We can consider now the current version of your manuscript in compliance with the policy of the journal.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3123-CEC2
RC1:
'Comment on egusphere-2025-3123', Anonymous Referee #1, 08 Aug 2025
Review of H2CM (v1.0): hybrid modeling of global water-carbon cycles constrained by atmospheric and land observations
The authors present a new global hybrid model (H2CM) that couples terrestrial water and carbon cycles by blending physically based equations with neural network components, constrained by multiple observational data streams. The study is timely and potentially significant, given growing interest in machine learning augmented Earth system models. The authors clearly describe the model design, data constraints, and evaluation. The integration of a hybrid hydrological model (H2MV) with a conceptual carbon cycle model is new. The results demonstrate strong performance, notably in capturing seasonal carbon flux patterns that some process models miss. I find the work scientifically interesting and largely well executed. However, several clarifications and improvements are recommended:
Scientific significance and novelty

The authors propose the first global hybrid model explicitly coupling water and carbon cycles with ML-guided parameters. This addresses a recognized gap, the integration of observational constraints on both hydrology and carbon is novel. The model’s ability to reveal patterns (e.g., precipitation-use efficiency, water-use efficiency) demonstrate added value beyond traditional models. The work thus represents a significant advance toward next-generation hybrid land-surface models. I suggest that the authors should highlight more explicitly how H2CM differs from and advances prior approaches. Similarly, highlight that hybrid modeling is still “young and evolving” (l.47) and that most previous work was at the proof-of-concept stage, underscoring H2CM’s novelty. If there are any other related models (even sub-global studies), a brief comparison would strengthen the novelty claim.
Methodology and model design

The model architecture is generally well described. H2CM extends H2MV hydrology by adding a carbon cycle (Eqs. 1-4). Transpiration is computed from FAPAR, potential ET, and a parameter alpha_T (Eq. 1). GPP is linked to transpiration via a NN learned WUE and a CO2-fertilization term beta (Eq. 2). NPP uses a NN learned CUE (Eq. 3), and heterotrophic respiration (Rh) follows a Q10 function (Eq. 4) with a NN learned basal respiration rate Rb. The modeling choices are physically plausible, and the coupling (via WUE linking T and GPP) is reasonable. Table 2 clarifies how each neural network is guided by selecting meaningful inputs (e.g., WUE depends on soil moisture, VPD, radiation). This guided-NN strategy improves interpretability.
Is there one NN for each output variable? (ll.60) Why was it better to use several models? Do you performed experiments using one model for several outputs with various inputs? Please be more detailed here and explain why your approach was best.

You name your target variables for the ML tasks model constraints – but indeed there are no constraints in the model. Your constraints are predicted target variables by a ML algorithm and controlled by the performance of the ML model. What about out-of-sample inputs? They are not constrained and depend on the generality of your model.

The Greek variables (e.g., alpha_T) were trained by NNs – but it is not clear how you trained these parameters. Which target variable was used? These parameters seem to be hidden variables in the NNs, no target variables. Please be more precise about your ML architecture and a detailed ML model description. How is alpha_T integrated in your NN?

Please also clarify the CO2 dependency using beta in Eq. (2) so readers can understand how fertilization enters the model.

Also, how was the WUE learned in the model? On which spatial and temporal resolution are these parameters learned? I feel like having not enough information to fully understand your underlying ML architecture.

145ff.: Are you using time series or only single time steps as input for your LSTM? I assume you were using time steps as the latter would not make sense. But it is not clearly described and misunderstanding in your description.

I understand that you used a simplified overview of your model architecture. But there is still missing more detailed information about the network architecture of you NN components. It would be helpful to have another figure especially for these components as well – as your model relies on the hybrid approach. How many layers, number neurons, training epochs, learning rate, any dropout or weight decay were used? How are the different NNs connected?

You also mention a FCNN for data compression: What kind of architecture was used here? Was it an unsupervised approach?

I do not see any hyperparameter tuning in the manuscript. How were model hyperparameters chosen and/or validated?

In Tab. 3 WUE and CUE are defined as ratios – but in Tab. 2 these variables are defined as functions depending on multiple variables, trained by a NN. Please be more precise here on how the definitions are meant for you approach.

The results are well presented. I am missing a short paragraph on the evaluation of the several trained NNs, for example on the performance of the WUE, CUE, etc. prediction alone. To increase confidence in the performance of H2CM, a brief description of the performance of the sub-variables would be helpful.

The NNs are trained by MSE Loss (Eq. 5) averaged equally over all data constraints. This implies that all constraints (TWS, SWE, ET, runoff, FAPAR, GPP, NEE, etc.) are treated with the same priority, regardless of their units or uncertainties. The authors should comment on this. might some constraints dominate the loss? Have the authors normalized each variable or adjusted for data uncertainty? Some acknowledgment of observational errors (and how they might affect the loss weighting) is appropriate.

The 10-fold CV is spatial only. Thus, it is not clear how well the model would predict an unseen year (e.g., a future year). I encourage the authors to comment on this limitation. If possible, as a future step, holding out later years for test could provide insight into model stability under changing climate.

To you considered to use e.g., Physical Informed Neural Networks (PINNs) instead of simple FCNN to better control and constrain the physical processes behind?

Overall, the methodology is sound and described in good detail. Small clarifications and additional details (especially on the neural-network implementation) would improve reader understanding and reproducibility.
The authors treat global parameters e.g., beta as learnable. Section 3.1.5 shows the learned Q10 is about 1.24, which is lower than typical literature values (1.4–2). Similarly, the learned beeta values greatly exceed observational estimates. The authors rightly note this discrepancy and attribute it to equifinality and insufficient constraints. Please briefly discuss the implications: e.g., a high beta means the model might overestimate CO2 sensitivity if used for future scenarios. Emphasize that these global parameters are effectively unconstrained by data and could be fixed based on independent knowledge.
Model evaluation and benchmarking

Correlation and RMSE are mentioned, but it would help to provide bias or error values in the text or supplementary tables. E.g., “small RMSE for NEE IAV” (l.236), but exact numbers or global bias would be useful. A table summarizing global or zonal RMSE and bias for GPP, NEE, etc., in comparison to benchmarks would complement the discussion.

Reproducibility and transparency

It would be helpful to have additional documentation (README, installation instructions) and example scripts/notebooks to run the model. Now, all daily outputs are shared.
Interpretation and discussion of results

The authors could strengthen the interpretation by commenting on potential future applications. E.g., since the model currently lacks an energy cycle (mentioned as future work), are there plans to incorporate dynamic vegetation or disturbances (aside from fire emissions)?
Minor stuff
104: Please write Transpiration T to introduce the variable.

Figure 4: Too small and the solid black background confuses. I suggest to make clean figures on white background. The title is also too small and does not fit to the explanations given in the caption. Please double check that your presented data fit to the presented titles in the figure.

In Tab. 1 the meteorological forcing data are described. Please briefly explain why you decided for this mixture of data sources.

100: You use the Greek letter for globally constant parameters. Does it include spatially and temporally constant?

As the various datasets span different periods, the manuscript should explicitly state the time period used for training/evaluation. Ensure it is clear how these are aligned.

The authors may note that dynamic vegetation changes are not included due to static land use input, though FAPAR input does implicitly capture some phenological variability.

Median and range of Q10 across folds is mentioned. It may be useful to similarly report the spread of prediction metrics across the 10 CV models. This would indicate robustness.

the conclusion asserts that H2CM “accurately reproduces the monthly patterns” and “global patterns” of GPP and NEE. While this is supported by the results, it may sound slightly overconfident given some know biases. Perhaps soften to “reproduces major features of the seasonal and spatial patterns…”

Overall, the writing is professional and detailed, with only minor edits needed for polish.

Recommendation
I recommend major revisions before acceptance based on the recommendations above. The reported revisions will strengthen the papers clarity and reproducibility but do not undermine the core findings.
Citation: https://doi.org/10.5194/egusphere-2025-3123-RC1
- AC2: 'Reply on RC1', Zavud Baghirov, 14 Nov 2025
  
  Dear reviewer 1,
  Thank you for your helpful review and thoughtful comments. We appreciate your time and feedback. Please find the attached PDF, which contains our detailed responses to all points raised. The document includes replies to both reviewers, as several comments overlapped and were addressed together.
  Best regards,
  
  Zavud Baghirov on behalf of the authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-3123-AC2
RC2:
'Comment on egusphere-2025-3123', Anonymous Referee #2, 11 Aug 2025
The manuscript "H2CM (v1.0): hybrid modelling of global water-carbon cycles..." by Baghirov et al. addresses a relevant and timely topic: the hybrid modelling of the land surface and terrestrial biosphere. It reports on the architecture, training and evaluation of a hybrid prototype. In principle, I consider the paper suitable for the journal.
I also have a substantial number of general and specific questions however that the current version leaves open. In my opinion, the manuscript would be much clearer and more useful if they are addressed in the general framing and writing.

General points
I find it hard to understand to what extent this model can actually be considered "hybrid". I hardly see any process-based components in the model description. There are equations 1-4, but they are highly simplistic and high-level multiplicative relationships, far simpler than the complexity of the machine learning components, or typical components of process-based land surface models.
Moreover, the model supposedly captures the "water cycle" and "carbon cycle". Besides the fact that cycles would include atmosphere and ocean (otherwise the cycle is not closed), the model does not seem to simulate any carbon pools - only fluxes. If this model is supposed to be a step toward hybrid land surface modelling (that’s how I understand the framing and motivation), what should be the approach to model differential equations where state variables have memory? How would one implement a similar model into an Earth system model, and what conclusions do the authors draw from their results to this end? What is it in the results that allows conclusions about the best approaches to such hybrid modelling?
It is also unclear to me how soil moisture is modelled. There is reference to another recent study on what is called H2MV (Baghirov et al., 2025). I had a look there, but it seems to follow a similar approach in the sense that the model’s mechanistic complexity and structure is rather simple, while model results seem to be mainly determined by the machine learning components.
Achieving a good match with observations with such a model is of course beneficial, but I wonder how well the model is able to extrapolate to different climates. For example, will it generate realistic trends when forced with data from the historical period over several decades, including the global warming trend? If not, why do we need a hybrid approach at all? To what extent do the process-based parts in the model contribute to the good performance? What makes H2CM better than Fluxcom-X-base in some cases – is it really the process-based part or is it a better machine learning approach or data? And whatever the answer is: Can the authors show this somehow? They say that a hybrid model is not a "black box" like ML models, so this may be possible? If the performance overall is largely determined by the data-driven parts (including the way different neural networks are combined), I wonder whether the framing of "hybrid modelling" is really helpful, in contrast to pure data-driven modelling with a specific architecture.

Regarding the general architecture of the model, Fig. 2 is helpful, but it is difficult for me to understand how the model is actually trained. The neural networks seem to generate inputs to what the authors call the "process-based water-carbon cycle model", which then generates observable variables. When the loss function is minimised during training, in what way is the process-based component used? Does it not need to backpropagate information somehow in order to feed back to the neural networks and let them learn? Also, how do the authors use information on observational errors, specifically where different datasets on the same quantity (the two atmospheric co2 inversions) are used at the same time?
All observational datasets seem to always be used at the same time to train the model? Some parameters seem to be overconstrained. Which training data is actually important? How are physical constraints regarded, e.g. the conservation of mass? And why do the authors only train on a subset of grid cells but not time points?
Lastly, Section 3 in general shows seveal metrics, variables and regions, and evaluates H2CM. The choices of what to show here felt somewhat arbitrary to me, for example Sect. 3.3 and also Fig. 5. Why pick these examples? What is the key message that these results support? It would help if the authors presented clear arguments and criteria, and connected the results in an argumentative way.

More detailed points
The authors say that H2CM is a "global" model. What does this mean? As far as I see, it is a local (grid cell specific) model without any spatial interactions, hence the domain and grid are arbitrary.

Use of vocabulary: Note that the term grid refers to the spatial structuring of all grid cells. A grid cell refers to one spatial point. The authors often use "grid" even where they actually mean grid cell.

There are some typos; I suggest the authors read carefully before the next submission. Example: line 65-66: "the the", "objectives" (omit s), "withhold" (withheld), line 152 "compress"(es), line 263: "in in", Fig. B8 caption: "Runoff" should be lower case.

Table 1: shortwave and longwave radiation seem to not be distinguished. But in practice, this will matter much for GPP and other fluxes. What is the underlying assumption here? Also, what is "short-term" versus "long-term" in the last two lines of the table? It could make sense to add a column showing the time period available for each dataset.

line 105 (Eq. 1): How is ETpot computed?

line 114-117, incl. Eq. 2: beta is supposed to capture the CO2 fertilisation effect, but it is just a constant, independent of CO2. The fertilisation effect is captured already by the linear dependence of GPP on CO2. What does this linear dependence imply when using the model for a transient situation with strongly increasing CO2? When considering all factors of Eq. 2, does the model generate a similar relationship as e.g. typical DGVMs?

line 140: make clearer what you mean with labels "dynamic (recurrent)" and "static (fully connected)". Even though it may not be possible to draw the true architecture in Fig. 2, it would help to show different (idealised) icons for the NNs where these NNs have different architecture. If the figure becomes too busy: I don’t think one actually needs to show global maps for all variables (which are too small to see results anyway). This figure is about the structure not the actual data values.

line 184: I did not understand what the authors mean with "blocks". Are blocks the samples of 5x5 connected grid cells that are selected for training?

line 187-189: I is not really clear to me why validation on left-out time periods should not be possible.

line 191 and elsewhere. The authors cite Baghirov et al., 2025, but four references like that are listed in the reference list.

line 196-197: Parameters theta and beta are adjusted – but how (see above)? How does training work involving the "process-based" model (whatever that is, also see above)?

line 199-201: What does it mean that the loss function is applied for each data constraint?! Isn’t there one loss function where all different variables contribute? Or several loss terms? Then how to decide how important each loss is? Additionally, I don’t understand why the Carboscope dataset is treated differently from all others.

line 204: perhaps briefly mention what a z-transformation is.

line 207: What is a "CV fold"?

line 208-209: If all input is z-transformed, that means that all means are zero and standard deviation is 1? How then can the model be calibrated to respond to the correct mean values? For instance, how would the model respond to input temperature data that is 2°C higher than observed? This question also relates to the generalisability question above, and the question how the model responds to climate trends.

line 223 (Eq. 7): This seems to be monthly anomalies. I would then not call that "Interannual variability"! And: If IAV is actually monthly variability, what is then the "monthly" values shown in Fig. 3? What is the difference? Is "monthly" the absolute data including seasonality, and "IAV" are the monthly anomalies?

3: (i) Why does the monthly data have much larger error than the monthly anomalies (IAV), whereas the other metrics look very good? (ii) Please make vertical axis ranges identical where possible. (iii) There is a lot of empty space in the figure, e.g. between bars. (iv) I don’t understand the difference between the columns. The training data is always the same, and the authors evaluate different variables? Why then two columns for NEE? Does the training data differ? (v) What determines the range covered by the boxes? Maximum and minimum error from what distribution?

4: (i) The grey colour makes it too hard to see the text. (ii) titles per panel or column would help. (iii) absolute GPP values are hard to compare between columns, perhaps add difference plots. (iv) What is meant by "members" in each case? Members from the 10 subsamples of grid cells when training H2CM? And in case of TRENDY are members the individual models? Does the map then show the median from all models at each grid cell, i.e. each grid cell comes from a different vegetation model?

5: (i) too grey (see above). (ii) "emerging global patterns" in what data? The trained model I guess? (iii) What are "folds"? Is this figure meant to show how realistic H2CM output is? Then we would need to see observations as comparison. Or is this result meant to offer new insights into land-atmosphere physics? Then this should be a clearer part of the framing in the abstract, introduction and conclusions.

Sect 3.2: What is it that makes H2CM better than Fluxcom? Can the authors demonstrate this? What are the implications for hybrid land modelling in general?

line 407: "the information is available" – which information?

line 408: "the model’s process formulations permit it" – which process formulations? Is there evidence that they restrict the results in some way?

line 411-414: How is the spread of results evidence for equifinality? Doesn’t equifinality mean the opposite, i.e. that different parameters (different models) lead to the same result?

A1: (i) What is k1, k2...? Why "k"? Are these the samples used for training different versions of the model? Every block here is 5x5 grid cells? (ii) The testing set seems to be 1/11th of the data, i.e. not 10%. (iii) And what about the evaluation set mentioned in the text which should be another ~10%? (iv) What is a "fold"?

B1-B8. (i) MSC is the time mean over the entire period? (ii) Do time series show a spatial average? Over what region? (iii) Why not the same period in all figures? Due to limited training data? (iv) Which of these time series are actually from the identical model and should be physically consistent? Maybe these can be put into one figure with several panels. (v) What is TWS in Fig. B5? Total water storage? Why does it differ so much from GRACE? Because of the low resolution of GRACE? (vi) In Fig. B6, SWE is snow water equivalent?

B9: again, remove white space to condense figure. "Water cycle constraints" – sounds like different constraints are used here compared to the other applications? Which data was used here? This should become clearer.

Appendix C: Why is there a "prior" and a "posterior" parameter which sounds like Bayesian statistics language? The method described in the appendix rather seems to nudge a parameter toward a specific target value, instead of calibrating it after starting from an initial value.

Some more info on the parameter calibration method would help.

line 466-468: The fact that the posterior equals the prior could also imply that the nudging (loss term) is just too strong? Why is it evidence for an underdetermined problem? Would it make sense to put a factor 0<f<1 in the definition of the loss term? Then the final parameters could be different?

D1: What are transcom regions?
Citation: https://doi.org/10.5194/egusphere-2025-3123-RC2
- AC3: 'Reply on RC2', Zavud Baghirov, 14 Nov 2025
  
  Dear reviewer 2,
  Thank you for your helpful review and thoughtful comments. We appreciate your time and feedback. Please find the attached PDF, which contains our detailed responses to all points raised. The document includes replies to both reviewers, as several comments overlapped and were addressed together.
  Best regards,
  
  Zavud Baghirov on behalf of the authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-3123-AC3

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Zavud Baghirov on behalf of the Authors (31 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (03 Feb 2026) by Christoph Müller

RR by Anonymous Referee #2 (13 Feb 2026)

Suggestions for revision or reasons for rejection

Baghirov et al. have made noticeable efforts to address the concerns raised by me (and reviewer 1). In general, the revised manuscript is clearer than before, and I understand better now what the authors have done. I appreciate the authors' professionalism in addressing most concerns to some extent instead of dodging questions. Having said that, I believe that the manuscript can benefit more from further improvements in the presentation of the results. None of my points question the foundations of the scientific approach or the interpretation of results, but I think that the changes would substantially enhance the study's readability and therefore its success.

General suggestions
Framing: I understand now that the goal is to generate a reanalysis of carbon and water fluxes, and not to build a new land model (Sect. 3.4.1, page 22; and authors' replies to my questions). This makes sense. However, the title, abstract, and many other parts throughout the paper are in contradiction with this framing (or can at least be misunderstood). The abstract has virtually not changed since the last version and still starts with "We present the ... H2CM – a global model..." which is not wrong, but can be misleading. At least I initially expected a numerical model with time stepping schemes etc here. It would help if the abstract would state explicitly that the aim is to generate a flux reanalysis, using a combination of ML models, constrained by four algebraic linear equations linking T, GPP and NPP, and a simple equation for heterotrophic respiration Rh.
I also still wonder what the model is really made for if it's not predictions. The model is trained on observations and it is shown that it can match these observations, but why is the "reanalysis" it offers better than what we already have in the observational datasets? My impression from the figures is mostly that H2CM follows the training data. I suspect that the argument is that one can recover certain variables, features or regions that are undersampled in observations, because the model is able to transfer its skill to other grid cells or features, as the authors show. This purpose of the model, and the evidence for its usefulness, should be worked out more clearly throughout the paper.

Structure of the sections:
I still find the structure confusing. Very high-level and general information is provided rather late in the paper, while a lot of details are provided first.
For instance, as mentioned above, the aim of the paper is explicitly explained only in Sect. 3.4.1 (page 22) – very late in the paper. Instead of providing such context after the results, it would help to be aware of the scope from the start.
Also, I was missing a clear overview scheme of the whole approach, until I saw Fig. A1. I suggest to replace Fig. 2 by the current Fig. A1. Showing geographical maps is not needed there.
I am also convinced that the methods section should not start with the datasets, but with a high-level overview of the model, referring to the aforementioned figure. The order that would be most accessible to me would be: general approach, process-based equations, training method and loss function, datasets.
It would also help to put a brief overview paragraph in the introduction which explains the structure of the paper.
The information in the new paragraphs on page 22 and page 27 (Appendix point H) is extremely helpful, but it should be provided much much earlier, before the results are presented.

Relevance of the chosen hybrid structure:
The authors argue that the structural choices are beneficial, e.g. for the transparancy and even causality of the model. While I generally agree, it is not obvious that the results are practically better than what could have been achieved by one large neural network directly translating from environmental variables to ecosystem fluxes. For example, line 512-513 state that the model's skill emerges from the synergy of process-based and ML components, but this is largely a claim and not a result. Also, the authors state that NEE is better than in Fluxcom and TRENDY. Is it because of the structural constraint of Eq. 1-4? Or just because the machine learning models applied here allow a better fit? The authors should reflect on this a bit more.

Performance and Evaluation
- line 291-294: I don't understand the argument here. Why should a lower variance reduce correlations? It should be the opposite: variance affects the rmse, but does not affect the correlation. Also, what is meant with "long-term monthly anomaly"? I don't see any short vs long-term results in the figures.
- Fig. 3: The monthly absolute data seems to perform almost identically to the mean seasonal data. One could remove the MSC bars and just say so, but I have no strong opinion about that. The fact that the mean seasonal cycle is captured with correlations very close to 1 implies that the phase is well caputured, I guess. The SDR is also close to 1, which to me implies that the amplitude is also well captured (at least for GPP and NEE (OCO-2). However, the rmse for GPP is very large compared to the deseasoned anomalies, which perform much worse regarding correlation and SDR. This is something I do not understand and which may need explaining. Lines 302-305 say that high RMSE "reflects errors in reproducing both the amplitude and phase of the seasonal cycle" – but why are Pearson's r and SDR almost perfectly at 1 then? Could it be that the annual cycle is so huge compared to the monthly anomalies that even a tiny relative error produces a comparably large rmse? Is that plausible?

Details
- Fig. 2 should be replaced by A1 which shows the same structure but much more clearly. What I still miss in the figure is a representation of the training process: Shouldn't information be passed backwards in order to optimise parameters? Currently, the Figure shows a one-way flow of information. Also, if one could indicate which steps happen on daily time steps and which on monthly steps (e.g. using colours of text of boxes), that would help further. Ideally, one could even refer to the datasets listed in Table 1 by colour coding.
- While Table 1 is great to get an overview of the used datasets, it can still become clearer which variable is generated on which time step, and on which resolution, in the model itself. Lines 85-87 really help already, but imply that most data is remapped to 1°, but some data is on 1/30°. How can the model work with mixed resolutions?
- I am still not sure about the precise meaning of "data constraints" in this paper. Do the authors mean to distinguish the data that is used to train the LSTMs and FC-NNs in Fig. A1 from the data that is comparable to the output from the process-based part of the model (NEE, ET, TWS, ...)? Then please say so.
- line 126 and elsewhere: "the process-based component". To my understanding, the only process-based parts are equations 1-4 (and perhaps something in the soil water balance module that is not explained here). Can the authors please be more specific and refer to the equation(s) in each case (here Eq. 1)?
- line 135: I am still confused about the way CO2 fertilisation is implemented. Due to the linear relationship without offset (zero-order term), GPP goes to 0 when CO2 does, and it doubles when CO2 doubles (all else being equal). This does not appear to be realistic, and the value of beta_CO2 changes nothing about that relationship. If I understand the authors correctly (line 138-141), they claim that the overall fertilisation is mediated by the effect of CO2 on alpha_WUE appearing in the same equation. But stomatal response to CO2 is a different effect, so the equation seems to be confusing too different mechanisms. Moreover, CO2 is not even an explicit input to alpha_WUE, so I don't follow the argument here.
- line 154: I doubt if "stateful" is the best word here.
- line 166 and line 179, similar problem: "fully connected" is unclear, and line 167 "a dynamic NN" is confusing since the architecture is static and since all NNs simulate dynamics. It would help to add a sentence somewhere that makes clear what "static" versus "dynamic" means and find a good and consistent terminology for the NNs that generate spatial fields versus the ones that generate dynamics in time (time series).
- line 197: What is meant with "some of these... are directly constrained"? Are any variables indirectly constrained, because they inherit the improvement from the constrained variables (observables)?
- line 211-214: Mention that the data folds are sets of different grid cells. One still has to kind of guess otherwise.
- Table 3: "emerging global patterns" is a strange and unhelpful title. There are no patterns in the table.
- line 256: "each batch". It is not explained anywhere what a batch is, and this word only appears twice.
- Sect. 2.3.2, the loss function. I still don't understand from the paper how different data constraints are weighted. If they were all on the same time and space scales, I get it that they are equal. But does the global mean of the CarboScope data affect the loss term with equal weight as a single grid cell from any other dataset (i.e. almost not at all), or with the weight of any other global dataset? How is this implemented?
- line 303: It should be "anomalies", not "anomaly".
- Fig. 4: Each colour bars applies to three figures but is squeezed into the side of one figure. I suggest to place the bars next to the figures.
- All other geographical maps: Same problem, please place colour bars next to the maps.
- line 373: "expectation" (singular)
- line 374-375: The authors state that they have used TRENDY model data to constrain H2CM. This is confusing since I don't see TRENDY mentioned as training data in Table 1. I then rememberd that this is mentioned in line 103-107, but I did not understand what precisely the "soft constraint" is and how it is implemented. Isn't it a strength of H2CM to rely on observed data and structural relationships, and not DGVMs which are often quite biased and oversimplified?
- Sect. 3.2: I wonder if the authors are actually too fair to TRENDY DGVMs by using the model ensemble median, which is probably closer to observations than any randomly chosen model. Comparing this median to only one realisation of H2CM feels like an unfair comparison in favour of TRENDY. H2CM only has to be better than the average DGVM, I believe (and would also be faster).
- line 437: "explains most of the variation in NEE" compared to which reference? OCO-2?
- line 438: why is Fluxcom so bad here (R2=0.16)? And isn't H2CM trained on Fluxcom? Should it not inherit the bias?
- line 441-443 "accurately reproduces..." and so on: please refer to the figures where one can see this.
- line 458: "drier", not "more dry"
- Fig. 7: which year does it show? The black point on the map (Fig. 7a) is impossible to see, make it red.
- line 508: insert "the" before "study by Lee"
- line 518: Bayesian with capital B
- line 544: What is meant here with "initialization"? The model does not have a time-stepping scheme.
- line 548 + following + line 691 + elsewhere: "% 100 ppm -1" or even "15%100ppm-1" looks confusing.
- line 552: Could the fact that beta_CO2 is not identifiable be related to the linearity of Eq. 2 as discussed above? Perhaps any value can be chosen and is then corrected by trainging alpha_WUE.
- line 654: I know it is often used, but I never understand what "end-to-end" actually means. Please be more specific.
- all time series shown in the figures: "across 10 CV folds": does that mean that the time series show spatial averages, and are also averaged over 10 randomly sampled folds?
- Fig. 3, C9, D1: mark the value 1 (or 0, depending on the figure) with a horizontal line.
- Appendix E: This means that for each parameter, we obtain one value l, and add all l's to the loss L in Eq. 5?
- line 693: I don't understand what is meant by "may partly reflect a strong nudging term", please rephrase and/or explain.

Hide

RR by Anonymous Referee #1 (02 Mar 2026)

ED: Publish subject to minor revisions (review by editor) (10 Mar 2026) by Christoph Müller

AR by Zavud Baghirov on behalf of the Authors (22 Apr 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (22 Apr 2026) by Christoph Müller

AR by Zavud Baghirov on behalf of the Authors (04 May 2026)

Short summary

We introduce a new global model that links how water and carbon move through land ecosystems. By combining process knowledge with artificial intelligence that learns from observations, we model daily changes in vegetation, water and carbon cycle processes. This model outperforms both purely data-driven and traditional process models, especially in dry and tropical regions. This advance could improve current understanding of water–carbon cycle relationships.