|I was not one of the original reviewers of the MS but I have reviewed their comments, the authors' replies and the amended MS.|
I think the MS would benefit from some English copy editing, which should be provided by the journal. It is a hard slog to get through the paper. Part of it is due to writing style.
Generally the paper could be shortened as it contains a fair amount of discussion in the results section and thereby some overlap/repetition in the actual discussion section. Ideally a Results and Discussion section would be created along with a separate Conclusions. Although I would understand if appetite was low for such a large rewrite.
As you can see from my comments below I am quite critical of the chosen variables to assess the models with, as well as the use of single reference (observation-based) datasets to compare the models with. I would usually request that more datasets get incorporated as I view the use of only single datasets, when others exist readily, as not useful to truly evaluate a model. However in this MS I see the model results are sufficiently poor as to not make much difference. I don't mean that in a disparaging sense, most coupled models perform relatively poorly against reference datasets. Yet, I suggest for future papers to not rely upon single reference datasets unless only one really exists.
You needn't cite this or anything, but I would point out that land surface models typically do poorly for LAI, e.g. see Seiler et al. 2022, so the model poor LAI is by no means unique.
line 20 - what does 'weaker' FAPAR mean? Lower?
Why compare against NPP rather than the more directly observable GPP? GPP has at least four independent global products that could be compared to. NPP 'observations' on a global scale are always going to be a fairly derived/modelled product. Similar question for WUE, why not consider the ET and GPP themselves rather than a variable derived from them?
Lines 53 - 55 seems to imply that the land carbon cycle is switched off(?), if so, how do you do NPP? Even after fully reading the paper I still found this part puzzling. The paper seems to suggest no C cycle then it proceeds to compare LAI (so leaves made from C) and NPP (C again).
Line 60 - this point about making sure the implementation is free of defects could actually go in the abstract. Otherwise the last line of the abstract sounds a bit optimistic since the rest of the abstract seems to suggest that v3 and v4 share many of the same biases, which makes a reader wonder why there is so much optimism in the abstract's last sentence. To be more plain - reading the abstract without knowing that v3 and v4 are almost identical except for the framework implemented in makes one think the problems in JSBACH must be really stubborn to not improve much between versions! This point seems to have confused Ref #2 as their first comment implies they missed that one of the main motivations for this paper is to ensure the implementation was correct.
Table 1 should list number of snow layers so the older versions two layers can be listed alongside the newer versions five.
line 142 - how can wood turnover not be implemented but you can calculate NPP? Doesn't the NPP contribute to plant tissues which would need to turnover to present a continual buildup. I still don't get this part.
line 198 - 'skin reservoir on the surface' = ponded water?
line 199 - 'as or in' - fix, confusing as is.
line 244 - 'a high LAI typically reduces albedo', really? Grass with an LAI of 2 will still be lighter (higher albdeo) than needle leaf evergreen forest with LAI of 4 so I am not sure what is meant there.
Why compare against only one global LAI product? Is there reason to suppose that product is without biases? Similar question about the other variables. It is not sufficient to compare against only one observation-based dataset if other reliable ones are available. To do otherwise assumes that all 'observations' are without bias - a naive assumption at best.
Why move from two snow layers to five? I realize the full logic might be outside the scope of this paper but it puzzles me as to what would be gained. Is it just to have finer discretization?
Figure 1 is a nice inclusion
Fig 6 - does 'year' in the top two plots stand for annual mean? Perhaps make that more clear. Same in fig 7, table 4, and perhaps others.
line 409 - RSM is not defined prior to use.
line 414 - it is easy to miss that the 'general rule' comment is for January, I would suggest rewording to make clear it is not intended to be a year round phenomenon.
Fig 11 - max amount of water possible above wilting point? Is this field capacity or ? It is better to use more standard terms as the max amount of water possible above wilting could include inches of water ponded on the surface - the definition given is not excluding that.
line 513 - 'The above named opposing northern mid latitudes NPP biases'. Consider revising for clarity, since it is new para it is hard to know what the above named is.
L 562 - fix ref to appendix
Seiler, C., Melton, J. R., Arora, V. K., Sitch, S., Friedlingstein, P., Anthoni, P., Goll, D., Jain, A. K., Joetzjer, E., Lienert, S., Lombardozzi, D., Luyssaert, S., Nabel, J. E. M. S., Tian, H., Vuichard, N., Walker, A. P., Yuan, W., and Zaehle, S.: Are terrestrial biosphere models fit for simulating the global land carbon sink?, J. Adv. Model. Earth Syst., 14, https://doi.org/10.1029/2021ms002946, 2022.