the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Advanced climate model evaluation with ESMValTool v2.11.0 using parallel, out-of-core, and distributed computing
Abstract. Earth System Models (ESMs) allow numerical simulations of the Earth's climate system. Driven by the need to better understand climate change and its impacts, these models have become increasingly sophisticated over time, generating vast amounts of data. To effectively evaluate the complex state-of-the-art ESMs and ensure their reliability, new tools for comprehensive analysis are essential. The open-source community-driven Earth System Model Evaluation Tool (ESMValTool) addresses this critical need by providing a software package for scientists to assess the performance of ESMs using common diagnostics and metrics. In this paper, we describe recent significant improvements of ESMValTool’s computational efficiency, which allow a more effective evaluation of these complex ESMs and also high-resolution models. These optimizations include parallel computing (execute multiple computation tasks simultaneously), out-of-core computing (process data larger than available memory), and distributed computing (spread computation tasks across multiple interconnected nodes or machines). When comparing the latest ESMValTool version with a previous not yet optimized version, we find significant performance improvements for many relevant applications running on a single node of a high performance computing (HPC) system, ranging from 2.6 times faster runs in a multi-model setup up to 25 times faster runs for processing a single high-resolution model. By utilizing distributed computing on two nodes of an HPC system, these speedup factors can be further improved to 3.2 and 36, respectively. Moreover, especially on small hardware, evaluation runs with the latest version of ESMValTool also require significantly less computational resources than before, which in turn reduces power consumption and thus the overall carbon footprint of ESMValTool runs. For example, the previously mentioned use cases use 16 (multi-model evaluation) and 40 (high-resolution model evaluation) times less resources compared to the reference version. Finally, analyses which could previously only be performed on machines with large amounts of memory can now be conducted on much smaller hardware through the use of out-of-core computation. For instance, the high-resolution single-model evaluation use case can now be run on a machine with only 16 GB of memory despite a total input data size of 35 GB, which was not possible with earlier versions of ESMValTool. This enables running much more complex evaluation tasks on a standard laptop than before.
Competing interests: The authors have the following competing interests: Some authors are members of the editorial board of Geoscientific Model Development. The authors have also no other competing interests to declare.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.- Preprint
(1321 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 12 Mar 2025)
-
RC1: 'Comment on gmd-2024-236', Anonymous Referee #1, 07 Feb 2025
reply
Thank you for this article. This was generally clearly written and it is a good reference for ESMValTool users and developers. I do appreciate that real-world and clearly defined use cases have been analysed. The benefits of the work to the global model evaluation community are also clearly articulated.
I am therefore happy to recommend this article for publication in GMD but please find some comments that could improve the quality of the article.
One section in particular would benefit from some clarifications. I do find the section between lines 285 and 301 explaining scalability in Table 2 quite difficult to follow.
The explanation about coordinate files is difficult to understand. On one hand, it is referred to as a serial operation (line 288), on the other hand it is said to be based on reading many small files (line 290)– which could lead to easy parallelism. Line 289 it is written that loading and processing coordinates takes 50% of runtime but without stating exactly on which case. Is it the reference case? It cannot be 50% equally on all experiments. Could this be explained more clearly ?
Line 296 says that some parts of the application profile are sensitive to Lustre load and ESGF servers connectivity. This begs the question: can we trust the timings presented in Table 2 and if yes - why?
Lines 298-302, this is a canonical example of Amdhal’s law, but it is not completely correctly stated in the text (for example scalability >1 should remain true analytically).
Citation: https://doi.org/10.5194/gmd-2024-236-RC1 -
RC2: 'Comment on gmd-2024-236', Anonymous Referee #2, 17 Feb 2025
reply
Thank you for your submission. ESMValTool is clearly part of the production workflow for evaluating the output of Earth system models, and therefore its optimization is of significant importance to the climate community. The paper is well written and establishes a clear performance improvement of version 2.11.0 over the previous 2.8.0 version, The significant improvements are the possibility for out-of-core computing, which allows 2.11.0 to run on configurations where 2.8.0 reports out-of-memory errors, and the support for distributed execution. The paper clearly illustrates the resulting performance increase.
Having said that, there are some key deficiencies in the methodology. In particular, the scaling efficiency metric provided in Equation (2) leads to confusion in the subsequent performance analysis, particularly in runs using "1/16 node". For example, the remarkable scaling efficiencies in Table 2 (e.g., 16) and Table 3 (e.g., 40) are in the cases where v.2.11.0 can run on 1/16-node where v2.8.0 on 1/16-node runs into memory limitations therefore cannot be used as the reference. These high values are an artifact of comparing the run time of v2.8.0 on a fully occupied node, but assuming that v2.11.0 is taking only one sixteenth of the node while all the other cores are doing effective work. However, most schedulers, particularly on DKRZ Levante, will not allow the exploitation of cores at this fine granularity, and thus the high efficiencies for the 1/16-node case are not realizable. The confusion increases in Figure 4 where v2.8.0 can be run on 1/16-node and therefore becomes the reference. The dramatically *low* efficiencies on a full node at the bottom of Figure 4 are again an artifact of the assumption that the other fifteen cores of the node could actually be doing effective work, which they realistically cannot.
My recommendation would be either to explain how all cores on the node can be sensibly occupied when ESMValTool occupies only 1/16-node, or to avoid the 1/16-node results entirely and concentrate of the objective improvements in the 1 and 2 node cases. The findings for the 1-node case are (1) v2.11.0 (threaded) is minimally slower or faster than v2.8.0 (threaded), but v2.11.0 (distributed) is signicantly faster, e.g., 1.7 - 2.0x in many cases, and 22x in the exceptional case of "extract_levels". The scaling efficiency should only be used in the case of comparing single node execution to multiple node (in this case 2), which essentially then becomes a strong scaling analysis. There are still remarkable improvements to report, in particular the case the authors recount in lines 280-284.
In spite of my concerns about the methodology, this paper does illustrate solid optimizations of a production tool which is central to data analysis. As you point out in lines 55-58, it is a key responsibility of our community to minimize the impact and carbon footprint of climate computing. But you must also mention and put into perspective that the vast majority of the footprint is coming from the simulations themselves rather than diagnostic tools such as ESMValTool.
Citation: https://doi.org/10.5194/gmd-2024-236-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
161 | 24 | 8 | 193 | 6 | 6 |
- HTML: 161
- PDF: 24
- XML: 8
- Total: 193
- BibTeX: 6
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1