OpenBench: a land model evaluation system

Wei, Zhongwang; Xu, Qingchen; Bai, Fan; Xu, Xionghui; Wei, Zixin; Dong, Wenzong; Liang, Hongbin; Wei, Nan; Lu, Xingjie; Li, Lu; Zhang, Shupeng; Yuan, Hua; Liu, Laibao; Dai, Yongjiu

doi:https://doi.org/10.5194/gmd-18-6517-2025

Articles | Volume 18, issue 18

https://doi.org/10.5194/gmd-18-6517-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-18-6517-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 18, issue 18

Methods for assessment of models

|

29 Sep 2025

Methods for assessment of models |

| 29 Sep 2025

OpenBench: a land model evaluation system

Zhongwang Wei, Qingchen Xu, Fan Bai, Xionghui Xu, Zixin Wei, Wenzong Dong, Hongbin Liang, Nan Wei, Xingjie Lu, Lu Li, Shupeng Zhang, Hua Yuan, Laibao Liu, and Yongjiu Dai

Download

Final revised paper (published on 29 Sep 2025)
Supplement to the final revised paper
Preprint (discussion started on 01 Apr 2025)
Supplement to the preprint

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-1380', Anonymous Referee #1, 16 May 2025
This paper presents a new software system, called OpenBench, to evaluate land surface models. OpenBench evaluates land surface models following a rigorous scientific method based on a wide range of statistical metrics and evaluation scores to allow for a quick and objective evaluation of various aspects of the models’ results. OpenBench showcases its capabilities by presenting a range of analyses accompanied by a varied array of representations. Although one may deplore the general scattering of effort in the community in developing such tools, the paper is generally well written and successfully explains the advantages of the software. Using Python and well-supported packages to write the software is a solid choice, ensuring potential widespread adoption and continuous support of dependencies. The paper clearly highlights how OpenBench differs from the existing tools with support for a range of data types and of model output formats, new variables linked to human activities and the possibility of user extension for other datasets, models or variables. Although OpenBench is using common evaluation metrics and scores, the set of metrics and scores chosen is pertinent and allows for an evaluation of a wide range of aspects for land surface model results. In addition, the paper explains how OpenBench differs in its handling and visualisation of the metrics and scores.

However, a few points of the paper need to be clarified. Firstly, the choice of a Fortran namelist format for the configuration file of a Python software is unusual. Fortran namelists are not the most flexible format for configuration files and are not well supported in Python. Common, popular choices like YAML, JSON or others have a much stronger support in Python and offer greater flexibility. It would be good to explain better why the Fortran namelist format was chosen for OpenBench.

A few points required clarification in the description of the metrics and scores. In Table 2, the bias metrics are described as “the smaller is better, ideal value is 0”. However, a lot of the metrics have an infinite range [-∞,∞], in which case the smaller value for the metrics isn’t 0 but -∞. It would be more accurate to say “the closer to 0 is better”.
The text explaining the various metrics used references metrics that do not appear in Table 2 and need to be clarified:
line 176: “For categorical data, the Kappa coefficient”, (bolding from me)

line 178: “and Percent Change in maximum and minimum values help identify” (bolding from me)

The variable naming in the calculation of nRMSEScore should be reviewed. The name “CRESM” is strange; shouldn’t it be “CRMSE”? Additionally, the error is once called ε_rmse and once ε_cresm.
The nPhaseScore score explanation needs to be reviewed. Several issues with it are likely linked and can be addressed together. I do not understand what “climatological mean cycles” (line 214) are, which cycles are referred to here? I also do not understand what is referred to with “of evaluation time resolution”. Finally, there are two mathematical symbols in the equations that are not explained, λ and φ.
Lastly, there is very little explanation of the nSpatialScore score. Why was this done so?

In the section showcasing the tool with some use cases, I disagree with the conclusion of the urban heat evaluation that “these findings highlight the importance of refined urban parameterization schemes in land surface models” (line 377). It isn’t clear why the results shown indicate this. The results indicate the CoLM2024 model performs well except for a specific zone, but do not show if models with different parameterizations do better or worse. Although I strongly agree that refined urban parametrizations can perform better, I disagree that the results shown in this paper allow us to draw a conclusion on the importance of urban parametrization.

In the multiple models comparison, at line 456, it says CoLM2024 and TE are the best models for canopy transpiration and total runoff, whereas figure 6 shows CLM5 and CoLM2024 are the best for the total runoff. The text in this section talks of “superior performance”. I would argue that we can’t qualify a score of 0.54 for the runoff of superior. It seems “highest” might be a better choice of qualifier in this case.

In the multiple models comparison section, I also question the choice of the vertical axis range in the parallel coordinates plot for the scores (figures 6b and 7b). I think these plots would be more informative if OpenBench used the same range from 0 to 1 for all the plots. In this way, the plots would visually highlight not only the relative position of the various models, but also the overall quality of all the models (how far from 1 all the models lie) and the relative performance of the models between each other (the spread of the lines would visually highlight if the models performed similarly or very differently). It would make it harder to identify small differences between models, which is, in my view, an advantage as small differences indicate similar performances. It is logical to keep the setting of the range for the vertical axis unchanged in the parallel coordinates plot for the metrics since, contrary to scores, a lot of metrics have an infinite range.

In the section comparing a model to multiple reference datasets, I find Figure 9 confusing. It presents a heatmap of various metrics for a model compared to several datasets. The same colormap is used for all metrics, with darker hues for higher metric values. Unfortunately, the metrics do not all show a better agreement at the higher values. Users then need to know the details of each of the metrics to interpret the table instead of being visually guided by the figure. This representation of the metrics would work better if OpenBench used different colormaps for different types of metrics: closer to 0 metrics with a darker hue at 0, metrics with the smallest values being the best with a darker hue for the smallest values, etc. I realise it is harder to put together, but it would greatly improve the representation.

Finally, the paper refers several times to the efficiency of the tool and points out the parallelisation using Dask. However, there is nothing in the paper to substantiate this. It would be good if some information could be given about the resources used and the time needed to produce the analyses that are showcased in the paper, for example.

Technical corrections:
Bold text indicates parts of the cited text that I modified to show needed corrections.
Line 29 and 31: “various changes in the Earth system”, “key components of Earth system models”. “Earth”, when referring to the planet, takes an uppercase
Line 164: “For example, bias metrics”, no uppercase to “bias”.
Line 188: “For a given variable 𝒗(𝒕, 𝒙), where 𝒕 represents time and 𝒙 represents spatial coordinates, we first calculate”. The first sentence here is not a sentence; replace the full stop after “coordinates” with a comma.
Line 194: “Where t0 and tf are the first and final timesteps, respectively.” Replace singular with plural.
Line 200: “Similarly to nBiasScore, we first calculate the centralized RMSE:”. “Similar” changed to “Similarly”, “We” changed to “we”, “RSME” changed to “RMSE”, and remove bolding of nBiasScore.
Line 239: “In contrast, OpenBench offers”. Replace “offering” with “offers”.
Line 277: The sentence finishing with “making it possible to evaluate.” is incomplete. It should be combined with the next sentence.
Figure 3 legend: Replace with “An example of a scores heatmap for GPP classified by IGBP land cover.”
Line 293: Considering OpenBench does not provide any datasets, the part saying “while OpenBench integrates a comprehensive collection of datasets,” would be more accurate as such: “while OpenBench integrates with a comprehensive collection of datasets,”
Citation: https://doi.org/10.5194/egusphere-2025-1380-RC1
- AC1:
  'Reply on RC1', Zhongwang Wei, 09 Jun 2025
  
  We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1380-AC1
  - AC3: 'Reply on AC1', Zhongwang Wei, 09 Jun 2025
    
    We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
    
    Citation: https://doi.org/10.5194/egusphere-2025-1380-AC3
RC2:
'Comment on egusphere-2025-1380', Anonymous Referee #2, 27 May 2025

This paper describes new cross-platform software system for evaluation and comparison of land surface models using a broad suite of metrics, statistics and comparison methods.
Authors clearly demonstrate OpenBench’s capabilities with various examples. Figures are comprehensive and clear. The manuscript is written very clearly, with few grammatical errors, and therefore I have few comments in this regard.
Regarding the software itself, I appreciate authors efforts to provide an easily accessible and runnable code base along with sample data for testing. However, I note if users follow “usage” instructions from the github repository README, there is no file provided for “nml/main.nml”, so the program fails. I was able to run the more complex example with sample data using the file “main-Debug.nml”, but I recommend authors update the codebase to provide a highly simplified “main.nml” for initial user testing, and clearer instructions on how to adapt the codebase for custom models/dataset analysis.
An internet connection is required for some plotting functions (e.g. to download Cartopy coastline), while some HPC environments may not have internet connectivity. Without internet connectivity, the program fails. A programmed exception to exclude downloading coastlines etc would improve functionality.
Regarding the manuscript, authors may wish to comment in the paper on the name “OpenBench”, and reduce reference to this being a “benchmarking system”, as readers may have a different interpretation of “benchmarking”. To my understanding, the broad meaning of benchmarking is comparison with a well-defined standard, or an a-priori performance expectation (e.g. see introduction and explanatory figures in your reference Best et al., 2015). This software undertakes evaluation and comparison without explicitly benchmarking (using the definitions in Best et al.,). However, I recognise that others in the community use “benchmarking” differently (e.g. in ILAMB). This could be commented on in the paper.
Some referenced models, datasets or studies are not properly referenced. For example: CLASS, CABLE, PLUMBER2. Please include relevant references.
Also ensure all acronyms are defined. For example, I cannot find a definition for uRMSD used in Figure 10. Overall, figure captions could be improved by reducing or explaining acronyms.
Please ensure software in Table 2 is properly named. For example ESMVal should be ESMValTool, and PALS has changed their name to modelevaluation.org.
Overall, I see great potential to this work, and congratulate authors for this contribution. I look forward to integrating OpenBench into my evaluation workflow.

Citation: https://doi.org/10.5194/egusphere-2025-1380-RC2
- AC2: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
  
  We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1380-AC2
- AC4: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
  
  We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
  
  Citation: https://doi.org/10.5194/egusphere-2025-1380-AC4

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Zhongwang Wei on behalf of the Authors (09 Jun 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Jun 2025) by Dalei Hao

RR by Mathew Lipson (23 Jun 2025)

Suggestions for revision or reasons for rejection

I thank authors for providing this revised manuscript, and for incorporating previous suggestions. I have two suggestions relating to this latest revision.

1. Authors introduce a discussion on the parallelisation of the software, describing its performance as demonstrating “remarkable efficiency” and “outstanding scalability.” However, the results comparing 1 to 48 cores indicate that the scalability falls significantly short of linear. I recommend moderating this language to simply state that parallelisation provides speed improvements. Additionally, the section could be made more concise, as it is quite long.

2. Authors have linked to an updated software repository, and a thoroughly revised README here: https://github.com/zhongwangwei/OpenBench. I tried running the updated software following “Usage” instructions, with the command:

python script/openbench.py nml/main-Debug.json.

Unfortunately when running on my HPC I encountered this error:

FileNotFoundError: [Errno 2] No such file or directory: './output/Debug/output/scores/Evapotranspiration_stn_GLEAM_hybird_PLUMBER2_station_case_evaluations.csv'

In my first review I was able to run this to completion without error. I also suggested in my first review that authors provide a stripped down, highly simplified example script which will allow new users to test a basic version of OpenBench on their system (the previous example test was “main.json”). I reiterate this suggestion, as “main-Debug.json” appears to be a complex configuration for debugging purposes, i.e. it runs more than a simple case.

Otherwise, I approve of other changes authors have made in this revision and would be happy to review a final version.

Hide

RR by Anonymous Referee #1 (26 Jun 2025)

ED: Publish subject to minor revisions (review by editor) (03 Jul 2025) by Dalei Hao

AR by Zhongwang Wei on behalf of the Authors (05 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (08 Jul 2025) by Dalei Hao

AR by Zhongwang Wei on behalf of the Authors (11 Jul 2025) Manuscript

Short summary

Land surface models are used for simulating how Earth's surface interacts with the atmosphere. As models grow more complex and detailed, researchers need better tools to evaluate their performance. OpenBench, a new software system that makes the evaluation process more comprehensive and efficient, stands out by incorporating various factors and working with data at any scale, enabling scientists to incorporate new types of models and measurements as our understanding of Earth's systems evolves.