the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Reducing Time and Computing Costs in EC-Earth: An Automatic Load-Balancing Approach for Coupled ESMs
Abstract. Earth System Models (ESMs) are intricate models employed for simulating the Earth's climate, typically constructed from distinct independent components dedicated to simulate specific natural phenomena (such as atmosphere and ocean dynamics, atmospheric chemistry, land and ocean biosphere, etc.). In order to capture the interactions between these processes, ESMs utilize coupling libraries, which oversee the synchronization and field exchanges among independent developed codes typically operating in parallel as a Multi-Program, Multi-Data (MPMD) application.
The performance achieved depends on the coupling approach, as well as on the number of parallel resources and scalability properties of each component. Determining the appropriate number of resources to use for each component in coupled ESMs is crucial for efficient utilization of the High Performance Computing (HPC) infrastructures used in climate modelling. However, this task traditionally involves manual testing of multiple process allocations by trial and error, requiring significant time investment from researchers. Thus, making the process more error-prone, and often resulting in a loss in application performance due to the complexity of the task. This paper introduces the automatic load-balance tool (auto-lb), a methodology and tool for determining the resource allocation to each component within coupled ESMs, aimed at improving the application's performance. Notably, this methodology is automatic and does not require expertise in HPC to improve the performance achieved by coupled ESMs. This is accomplished by minimizing the load-imbalance: reducing each constituent's execution cost (core-hours), as well as minimizing the core-hours wasted resulting from the synchronizations between them, without penalizing the execution speed of the entire model. This optimization is achieved regardless of the scalability properties of each constituent and the complexity of their dependencies during the coupling.
To achieve this, we designed a new performance metric called "Fittingness" to assess the performance of coupled execution evaluating the trade-off between the parallel efficiency and application throughput. This metric is intended for scenarios where optimality can depend on various criteria and constraints. Aiming for maximum speed might not be desirable if it leads to a decrease in parallel efficiency and, therefore, increasing the computational costs of simulation.
The methodology was tested across multiple experiments using the widely recognized European ESM, EC-Earth3. The results were compared with real operational configurations, such as those used for the Coupled Model Intercomparison Project Phase 6 (CMIP6) and for the European Climate Prediction Project (EUCP), and validated on different HPC platforms. All of them suggest that the current approaches lead to performance loss, and that auto-lb can achieve better results in both, execution speed and reduction of the core-hours needed. When comparing to the EC-Earth standard-resolution CPMIP6 runs, we achieved a configuration 4.7 % faster while also reducing the core-hours required by 1.3 %. Likewise, when compared to the EC-Earth high-resolution EUCP runs, the method presented showed an improvement of 34 % in the speed, with a 6.7 % reduction in the core-hours consumed.
- Preprint
(1357 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 02 Mar 2025)
-
CEC1: 'Comment on gmd-2024-155: No compliance with the policy of the journal', Juan Antonio Añel, 29 Oct 2024
reply
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlYou have archived your code on Git repositories in servers that do not comply with the standards for long-term archival and accessibility. Therefore, please publish your code (EC-Earth versions and the prediction script) in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Therefore, the current situation with your manuscript is irregular.
In this way, if you do not fix this problem, we will have to reject your manuscript for publication in our journal.
Also, you must include the modified Code and Data availability sections in a potentially reviewed manuscript, with the DOI of the code.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/gmd-2024-155-CEC1 -
AC1: 'Reply on CEC1', Sergi Palomas, 14 Nov 2024
reply
Good afternoon,
Apologies for any inconvenience. The code is now available on a FAIR-aligned platform, Zenodo.
Here is the DOI: https://doi.org/10.5281/zenodo.14163512
Please let me know if you would like me to upload a revised manuscript with the updated "Code Availability" section, and I'll be happy to do so.
Best regards,
SergiCitation: https://doi.org/10.5194/gmd-2024-155-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 15 Nov 2024
reply
Dear authors,
Unfortunately, your reply does not address the issues I pointed out in my previous comment, and the new repository only contains the prediction script, which is useless to replicate your work. As I made clear in my previous comment, you must include in the repository the code of the EC-Earth3 model that you use in your work.
Regards,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2024-155-CEC2 -
AC2: 'Reply on CEC2', Sergi Palomas, 18 Nov 2024
reply
Dear Juan,
Apologies for the confusion. Regarding the EC-Earth3 model used in our work, we suggest the following wording for the "Code Availability" section of the manuscript (which includes the new repository for the prediction script and references to the Autosubmit workflow manager):
"""
The source code for the prediction script is publicly available at: https://doi.org/10.5281/zenodo.14163512 (Palomas, 2024).
The EC-Earth3 source code is accessible to members of the consortium through the EC-Earth development portal. Access to the EC-Earth3 source code can be requested from the EC-Earth community via
the EC-Earth website: http://www.ec-earth.org (last access: 18 November 2024). Model codes developed at ECMWF, such as the IFS atmospheric model, are the intellectual property of ECMWF and its member states. Therefore, access to the EC-Earth3 source code requires signing a software license agreement with ECMWF.
The version of EC-Earth used in this study is tagged as 3.3.3.1 in the repository.
The Autosubmit workflow manager is available as a Python package on PyPi (https://pypi.org/project/autosubmit/, last access: 18 November 2024), with its documentation and user guide hosted at https://autosubmit.readthedocs.io/en/master/ (last access: 18 November 2024).
"""In Latex:
\codeavailability{The source code for the prediction script is publicly available at: https://doi.org/10.5281/zenodo.14163512 \citep{prediction-script-zenodo}.
The EC-Earth3 source code is accessible to members of the consortium through the EC-Earth development portal. Access to the EC-Earth3 source code can be requested from the EC-Earth community via
the EC-Earth website: \url{http://www.ec-earth.org} (last access: 18 November 2024). Model codes developed at ECMWF, such as the IFS atmospheric model, are the intellectual property of ECMWF and its member states. Therefore, access to the EC-Earth3 source code requires signing a software license agreement with ECMWF.
The version of EC-Earth used in this study is tagged as 3.3.3.1 in the repository.The Autosubmit workflow manager is available as a Python package on PyPi (\url{https://pypi.org/project/autosubmit/}, last access: 18 November 2024), with its documentation and user guide hosted at \url{https://autosubmit.readthedocs.io/en/master/} (last access: 18 November 2024).}
This requires adding a new entry in the bibliography::
@misc{prediction-script-zenodo,
Howpublished = {sergipalomas/auto-lb\_prediction-script: version for publication (v1.0)},
author = {Palomas, S},
year = {2024},
doi = {10.5281/zenodo.14163512}
}
Which results in this line in the References:
Palomas, S.: sergipalomas/auto-lb_prediction-script: version for publication (v1.0), https://doi.org/10.5281/zenodo.14163512, 2024.We believe this aligns with what has been accepted in similar publications such as https://gmd.copernicus.org/articles/15/2973/2022/
Best regards,
SergiCitation: https://doi.org/10.5194/gmd-2024-155-AC2 -
CEC3: 'Reply on AC2', Juan Antonio Añel, 04 Dec 2024
reply
Dear authors,
Regarding your reply, it would be better if you store the version of EC-Earth that you use in this work in a Zenodo private repository. In this way, we are sure that it is permanently stored and located with a DOI, and in the meantime you keep the control on who can access it.
Please, let us know if there is something that prevents you of doing this.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2024-155-CEC3 -
AC3: 'Reply on CEC3', Sergi Palomas, 12 Dec 2024
reply
Dear Juan,
Unfortunately, we do not have the rights to upload the EC-Earth code to a private repository. The reason is that for developing and using EC-Earth, a software license from ECMWF is needed to cover the atmosphere model component IFS. This is because developing and using EC-Earth requires a software license from ECMWF to cover the atmosphere model component, IFS. This license is managed directly through the EC-Earth portal. Detailed information on obtaining the license can be found here: https://dev.ec-earth.org/projects/ecearth3/wiki/How_to_get_a_software_license_from_ECMWF.
If access to the EC-Earth code is required for review purposes, we can facilitate this through the appropriate channels.
I hope this better clarifies the situation. If this needs to be explicitly mentioned in the "Code Availability" section, we are happy to update it accordingly.Best regards,
SergiCitation: https://doi.org/10.5194/gmd-2024-155-AC3 -
CC1: 'Reply on AC3', Etienne Tourigny, 13 Dec 2024
reply
The linked webpage is only accessible to registered users of the EC-Earth development portal. We can provide details on the procedure upon request.
Citation: https://doi.org/10.5194/gmd-2024-155-CC1
-
CC1: 'Reply on AC3', Etienne Tourigny, 13 Dec 2024
reply
-
AC3: 'Reply on CEC3', Sergi Palomas, 12 Dec 2024
reply
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 04 Dec 2024
reply
-
AC2: 'Reply on CEC2', Sergi Palomas, 18 Nov 2024
reply
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 15 Nov 2024
reply
-
AC1: 'Reply on CEC1', Sergi Palomas, 14 Nov 2024
reply
-
RC1: 'Comment on gmd-2024-155', Vadym Aizinger, 01 Feb 2025
reply
General comments:
The paper addresses a very significant issue of load imbalances on (large) parallel runs of coupled ESM models. The proposed solution and the developed tool represent a meaningful contribution to alleviating the waste of computational resources and giving the users of the ESM models a better control over the coupled simulation runs and their overhead.
The current version of the paper is quite polished and, with an exception of a handful of mostly technical errors, nearly ready for publication.
A suggestion that the authors might want to briefly discuss in the outlook: How difficult would be an extension of the proposed methodology to heterogeneous/hybrid architectures (e.g. CPU-GPU systems)?
Specific comments:
I suggest to include a column with parallel efficiency in the Table 1.
Technical comments:
- 'can not' should be 'cannot' at several places in text
- lines 178-181: since a single node is taken as the baseline, processors and processes should be replaced with nodes
- line 201: remove the multiplication dot in the denominator
- lines 313-314: it seems that Figures 5a and 5b should be referenced there instead of Figures 10a and 10b
- line 332: 'worse the original...' should read 'worse than the original...'
- Section 5.3: Figure 10 should probably be referenced somewhere within this section
Citation: https://doi.org/10.5194/gmd-2024-155-RC1 -
AC4: 'Reply on RC1', Sergi Palomas, 10 Feb 2025
reply
Please find the answers from the authors in the attached document.
-
RC2: 'Reply on AC4', Vadym Aizinger, 10 Feb 2025
reply
Thank you, the changes look very good.
Citation: https://doi.org/10.5194/gmd-2024-155-RC2
-
RC2: 'Reply on AC4', Vadym Aizinger, 10 Feb 2025
reply
Model code and software
Prediction script Sergi Palomas https://earth.bsc.es/gitlab/spalomas/prediction-script
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
341 | 98 | 59 | 498 | 8 | 8 |
- HTML: 341
- PDF: 98
- XML: 59
- Total: 498
- BibTeX: 8
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1