Articles | Volume 18, issue 18
https://doi.org/10.5194/gmd-18-6517-2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.OpenBench: a land model evaluation system
Download
- Final revised paper (published on 29 Sep 2025)
- Supplement to the final revised paper
- Preprint (discussion started on 01 Apr 2025)
- Supplement to the preprint
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-1380', Anonymous Referee #1, 16 May 2025
-
AC1: 'Reply on RC1', Zhongwang Wei, 09 Jun 2025
- AC3: 'Reply on AC1', Zhongwang Wei, 09 Jun 2025
-
AC1: 'Reply on RC1', Zhongwang Wei, 09 Jun 2025
-
RC2: 'Comment on egusphere-2025-1380', Anonymous Referee #2, 27 May 2025
- AC2: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
- AC4: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
Peer review completion
AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload
AR by Zhongwang Wei on behalf of the Authors (09 Jun 2025)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (10 Jun 2025) by Dalei Hao
RR by Mathew Lipson (23 Jun 2025)

RR by Anonymous Referee #1 (26 Jun 2025)
ED: Publish subject to minor revisions (review by editor) (03 Jul 2025) by Dalei Hao

AR by Zhongwang Wei on behalf of the Authors (05 Jul 2025)
Author's response
Author's tracked changes
Manuscript
ED: Publish as is (08 Jul 2025) by Dalei Hao

AR by Zhongwang Wei on behalf of the Authors (11 Jul 2025)
Manuscript
This paper presents a new software system, called OpenBench, to evaluate land surface models. OpenBench evaluates land surface models following a rigorous scientific method based on a wide range of statistical metrics and evaluation scores to allow for a quick and objective evaluation of various aspects of the models’ results. OpenBench showcases its capabilities by presenting a range of analyses accompanied by a varied array of representations. Although one may deplore the general scattering of effort in the community in developing such tools, the paper is generally well written and successfully explains the advantages of the software. Using Python and well-supported packages to write the software is a solid choice, ensuring potential widespread adoption and continuous support of dependencies. The paper clearly highlights how OpenBench differs from the existing tools with support for a range of data types and of model output formats, new variables linked to human activities and the possibility of user extension for other datasets, models or variables. Although OpenBench is using common evaluation metrics and scores, the set of metrics and scores chosen is pertinent and allows for an evaluation of a wide range of aspects for land surface model results. In addition, the paper explains how OpenBench differs in its handling and visualisation of the metrics and scores.
However, a few points of the paper need to be clarified. Firstly, the choice of a Fortran namelist format for the configuration file of a Python software is unusual. Fortran namelists are not the most flexible format for configuration files and are not well supported in Python. Common, popular choices like YAML, JSON or others have a much stronger support in Python and offer greater flexibility. It would be good to explain better why the Fortran namelist format was chosen for OpenBench.
A few points required clarification in the description of the metrics and scores. In Table 2, the bias metrics are described as “the smaller is better, ideal value is 0”. However, a lot of the metrics have an infinite range [-∞,∞], in which case the smaller value for the metrics isn’t 0 but -∞. It would be more accurate to say “the closer to 0 is better”.
The text explaining the various metrics used references metrics that do not appear in Table 2 and need to be clarified:
The variable naming in the calculation of nRMSEScore should be reviewed. The name “CRESM” is strange; shouldn’t it be “CRMSE”? Additionally, the error is once called εrmse and once εcresm.
The nPhaseScore score explanation needs to be reviewed. Several issues with it are likely linked and can be addressed together. I do not understand what “climatological mean cycles” (line 214) are, which cycles are referred to here? I also do not understand what is referred to with “of evaluation time resolution”. Finally, there are two mathematical symbols in the equations that are not explained, λ and φ.
Lastly, there is very little explanation of the nSpatialScore score. Why was this done so?
In the section showcasing the tool with some use cases, I disagree with the conclusion of the urban heat evaluation that “these findings highlight the importance of refined urban parameterization schemes in land surface models” (line 377). It isn’t clear why the results shown indicate this. The results indicate the CoLM2024 model performs well except for a specific zone, but do not show if models with different parameterizations do better or worse. Although I strongly agree that refined urban parametrizations can perform better, I disagree that the results shown in this paper allow us to draw a conclusion on the importance of urban parametrization.
In the multiple models comparison, at line 456, it says CoLM2024 and TE are the best models for canopy transpiration and total runoff, whereas figure 6 shows CLM5 and CoLM2024 are the best for the total runoff. The text in this section talks of “superior performance”. I would argue that we can’t qualify a score of 0.54 for the runoff of superior. It seems “highest” might be a better choice of qualifier in this case.
In the multiple models comparison section, I also question the choice of the vertical axis range in the parallel coordinates plot for the scores (figures 6b and 7b). I think these plots would be more informative if OpenBench used the same range from 0 to 1 for all the plots. In this way, the plots would visually highlight not only the relative position of the various models, but also the overall quality of all the models (how far from 1 all the models lie) and the relative performance of the models between each other (the spread of the lines would visually highlight if the models performed similarly or very differently). It would make it harder to identify small differences between models, which is, in my view, an advantage as small differences indicate similar performances. It is logical to keep the setting of the range for the vertical axis unchanged in the parallel coordinates plot for the metrics since, contrary to scores, a lot of metrics have an infinite range.
In the section comparing a model to multiple reference datasets, I find Figure 9 confusing. It presents a heatmap of various metrics for a model compared to several datasets. The same colormap is used for all metrics, with darker hues for higher metric values. Unfortunately, the metrics do not all show a better agreement at the higher values. Users then need to know the details of each of the metrics to interpret the table instead of being visually guided by the figure. This representation of the metrics would work better if OpenBench used different colormaps for different types of metrics: closer to 0 metrics with a darker hue at 0, metrics with the smallest values being the best with a darker hue for the smallest values, etc. I realise it is harder to put together, but it would greatly improve the representation.
Finally, the paper refers several times to the efficiency of the tool and points out the parallelisation using Dask. However, there is nothing in the paper to substantiate this. It would be good if some information could be given about the resources used and the time needed to produce the analyses that are showcased in the paper, for example.
Technical corrections:
Bold text indicates parts of the cited text that I modified to show needed corrections.
Line 29 and 31: “various changes in the Earth system”, “key components of Earth system models”. “Earth”, when referring to the planet, takes an uppercase
Line 164: “For example, bias metrics”, no uppercase to “bias”.
Line 188: “For a given variable 𝒗(𝒕, 𝒙), where 𝒕 represents time and 𝒙 represents spatial coordinates, we first calculate”. The first sentence here is not a sentence; replace the full stop after “coordinates” with a comma.
Line 194: “Where t0 and tf are the first and final timesteps, respectively.” Replace singular with plural.
Line 200: “Similarly to nBiasScore, we first calculate the centralized RMSE:”. “Similar” changed to “Similarly”, “We” changed to “we”, “RSME” changed to “RMSE”, and remove bolding of nBiasScore.
Line 239: “In contrast, OpenBench offers”. Replace “offering” with “offers”.
Line 277: The sentence finishing with “making it possible to evaluate.” is incomplete. It should be combined with the next sentence.
Figure 3 legend: Replace with “An example of a scores heatmap for GPP classified by IGBP land cover.”
Line 293: Considering OpenBench does not provide any datasets, the part saying “while OpenBench integrates a comprehensive collection of datasets,” would be more accurate as such: “while OpenBench integrates with a comprehensive collection of datasets,”