Baghirov et al. have made noticeable efforts to address the concerns raised by me (and reviewer 1). In general, the revised manuscript is clearer than before, and I understand better now what the authors have done. I appreciate the authors' professionalism in addressing most concerns to some extent instead of dodging questions. Having said that, I believe that the manuscript can benefit more from further improvements in the presentation of the results. None of my points question the foundations of the scientific approach or the interpretation of results, but I think that the changes would substantially enhance the study's readability and therefore its success.
General suggestions
Framing: I understand now that the goal is to generate a reanalysis of carbon and water fluxes, and not to build a new land model (Sect. 3.4.1, page 22; and authors' replies to my questions). This makes sense. However, the title, abstract, and many other parts throughout the paper are in contradiction with this framing (or can at least be misunderstood). The abstract has virtually not changed since the last version and still starts with "We present the ... H2CM – a global model..." which is not wrong, but can be misleading. At least I initially expected a numerical model with time stepping schemes etc here. It would help if the abstract would state explicitly that the aim is to generate a flux reanalysis, using a combination of ML models, constrained by four algebraic linear equations linking T, GPP and NPP, and a simple equation for heterotrophic respiration Rh.
I also still wonder what the model is really made for if it's not predictions. The model is trained on observations and it is shown that it can match these observations, but why is the "reanalysis" it offers better than what we already have in the observational datasets? My impression from the figures is mostly that H2CM follows the training data. I suspect that the argument is that one can recover certain variables, features or regions that are undersampled in observations, because the model is able to transfer its skill to other grid cells or features, as the authors show. This purpose of the model, and the evidence for its usefulness, should be worked out more clearly throughout the paper.
Structure of the sections:
I still find the structure confusing. Very high-level and general information is provided rather late in the paper, while a lot of details are provided first.
For instance, as mentioned above, the aim of the paper is explicitly explained only in Sect. 3.4.1 (page 22) – very late in the paper. Instead of providing such context after the results, it would help to be aware of the scope from the start.
Also, I was missing a clear overview scheme of the whole approach, until I saw Fig. A1. I suggest to replace Fig. 2 by the current Fig. A1. Showing geographical maps is not needed there.
I am also convinced that the methods section should not start with the datasets, but with a high-level overview of the model, referring to the aforementioned figure. The order that would be most accessible to me would be: general approach, process-based equations, training method and loss function, datasets.
It would also help to put a brief overview paragraph in the introduction which explains the structure of the paper.
The information in the new paragraphs on page 22 and page 27 (Appendix point H) is extremely helpful, but it should be provided much much earlier, before the results are presented.
Relevance of the chosen hybrid structure:
The authors argue that the structural choices are beneficial, e.g. for the transparancy and even causality of the model. While I generally agree, it is not obvious that the results are practically better than what could have been achieved by one large neural network directly translating from environmental variables to ecosystem fluxes. For example, line 512-513 state that the model's skill emerges from the synergy of process-based and ML components, but this is largely a claim and not a result. Also, the authors state that NEE is better than in Fluxcom and TRENDY. Is it because of the structural constraint of Eq. 1-4? Or just because the machine learning models applied here allow a better fit? The authors should reflect on this a bit more.
Performance and Evaluation
- line 291-294: I don't understand the argument here. Why should a lower variance reduce correlations? It should be the opposite: variance affects the rmse, but does not affect the correlation. Also, what is meant with "long-term monthly anomaly"? I don't see any short vs long-term results in the figures.
- Fig. 3: The monthly absolute data seems to perform almost identically to the mean seasonal data. One could remove the MSC bars and just say so, but I have no strong opinion about that. The fact that the mean seasonal cycle is captured with correlations very close to 1 implies that the phase is well caputured, I guess. The SDR is also close to 1, which to me implies that the amplitude is also well captured (at least for GPP and NEE (OCO-2). However, the rmse for GPP is very large compared to the deseasoned anomalies, which perform much worse regarding correlation and SDR. This is something I do not understand and which may need explaining. Lines 302-305 say that high RMSE "reflects errors in reproducing both the amplitude and phase of the seasonal cycle" – but why are Pearson's r and SDR almost perfectly at 1 then? Could it be that the annual cycle is so huge compared to the monthly anomalies that even a tiny relative error produces a comparably large rmse? Is that plausible?
Details
- Fig. 2 should be replaced by A1 which shows the same structure but much more clearly. What I still miss in the figure is a representation of the training process: Shouldn't information be passed backwards in order to optimise parameters? Currently, the Figure shows a one-way flow of information. Also, if one could indicate which steps happen on daily time steps and which on monthly steps (e.g. using colours of text of boxes), that would help further. Ideally, one could even refer to the datasets listed in Table 1 by colour coding.
- While Table 1 is great to get an overview of the used datasets, it can still become clearer which variable is generated on which time step, and on which resolution, in the model itself. Lines 85-87 really help already, but imply that most data is remapped to 1°, but some data is on 1/30°. How can the model work with mixed resolutions?
- I am still not sure about the precise meaning of "data constraints" in this paper. Do the authors mean to distinguish the data that is used to train the LSTMs and FC-NNs in Fig. A1 from the data that is comparable to the output from the process-based part of the model (NEE, ET, TWS, ...)? Then please say so.
- line 126 and elsewhere: "the process-based component". To my understanding, the only process-based parts are equations 1-4 (and perhaps something in the soil water balance module that is not explained here). Can the authors please be more specific and refer to the equation(s) in each case (here Eq. 1)?
- line 135: I am still confused about the way CO2 fertilisation is implemented. Due to the linear relationship without offset (zero-order term), GPP goes to 0 when CO2 does, and it doubles when CO2 doubles (all else being equal). This does not appear to be realistic, and the value of beta_CO2 changes nothing about that relationship. If I understand the authors correctly (line 138-141), they claim that the overall fertilisation is mediated by the effect of CO2 on alpha_WUE appearing in the same equation. But stomatal response to CO2 is a different effect, so the equation seems to be confusing too different mechanisms. Moreover, CO2 is not even an explicit input to alpha_WUE, so I don't follow the argument here.
- line 154: I doubt if "stateful" is the best word here.
- line 166 and line 179, similar problem: "fully connected" is unclear, and line 167 "a dynamic NN" is confusing since the architecture is static and since all NNs simulate dynamics. It would help to add a sentence somewhere that makes clear what "static" versus "dynamic" means and find a good and consistent terminology for the NNs that generate spatial fields versus the ones that generate dynamics in time (time series).
- line 197: What is meant with "some of these... are directly constrained"? Are any variables indirectly constrained, because they inherit the improvement from the constrained variables (observables)?
- line 211-214: Mention that the data folds are sets of different grid cells. One still has to kind of guess otherwise.
- Table 3: "emerging global patterns" is a strange and unhelpful title. There are no patterns in the table.
- line 256: "each batch". It is not explained anywhere what a batch is, and this word only appears twice.
- Sect. 2.3.2, the loss function. I still don't understand from the paper how different data constraints are weighted. If they were all on the same time and space scales, I get it that they are equal. But does the global mean of the CarboScope data affect the loss term with equal weight as a single grid cell from any other dataset (i.e. almost not at all), or with the weight of any other global dataset? How is this implemented?
- line 303: It should be "anomalies", not "anomaly".
- Fig. 4: Each colour bars applies to three figures but is squeezed into the side of one figure. I suggest to place the bars next to the figures.
- All other geographical maps: Same problem, please place colour bars next to the maps.
- line 373: "expectation" (singular)
- line 374-375: The authors state that they have used TRENDY model data to constrain H2CM. This is confusing since I don't see TRENDY mentioned as training data in Table 1. I then rememberd that this is mentioned in line 103-107, but I did not understand what precisely the "soft constraint" is and how it is implemented. Isn't it a strength of H2CM to rely on observed data and structural relationships, and not DGVMs which are often quite biased and oversimplified?
- Sect. 3.2: I wonder if the authors are actually too fair to TRENDY DGVMs by using the model ensemble median, which is probably closer to observations than any randomly chosen model. Comparing this median to only one realisation of H2CM feels like an unfair comparison in favour of TRENDY. H2CM only has to be better than the average DGVM, I believe (and would also be faster).
- line 437: "explains most of the variation in NEE" compared to which reference? OCO-2?
- line 438: why is Fluxcom so bad here (R2=0.16)? And isn't H2CM trained on Fluxcom? Should it not inherit the bias?
- line 441-443 "accurately reproduces..." and so on: please refer to the figures where one can see this.
- line 458: "drier", not "more dry"
- Fig. 7: which year does it show? The black point on the map (Fig. 7a) is impossible to see, make it red.
- line 508: insert "the" before "study by Lee"
- line 518: Bayesian with capital B
- line 544: What is meant here with "initialization"? The model does not have a time-stepping scheme.
- line 548 + following + line 691 + elsewhere: "% 100 ppm -1" or even "15%100ppm-1" looks confusing.
- line 552: Could the fact that beta_CO2 is not identifiable be related to the linearity of Eq. 2 as discussed above? Perhaps any value can be chosen and is then corrected by trainging alpha_WUE.
- line 654: I know it is often used, but I never understand what "end-to-end" actually means. Please be more specific.
- all time series shown in the figures: "across 10 CV folds": does that mean that the time series show spatial averages, and are also averaged over 10 randomly sampled folds?
- Fig. 3, C9, D1: mark the value 1 (or 0, depending on the figure) with a horizontal line.
- Appendix E: This means that for each parameter, we obtain one value l, and add all l's to the loss L in Eq. 5?
- line 693: I don't understand what is meant by "may partly reflect a strong nudging term", please rephrase and/or explain. |
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, you have not shared the input data used in your work, both for simulations and comparisons. It is necessary to you share such data to ensure the replicability of your work.
Also, you have not shared the full output of your simulations, but aggregated monthly data. You must share the full daily data resulting from your simulations.
Therefore, please, publish the mentioned data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Also, please, remember including a modified 'Code and Data Availability' section in any potentially reviewed manuscript, containing the information of the new repositories.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor