the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Refactoring the EVP solver for improved performance – a case study based on CICE v6.5
Abstract. This study focuses on the performance of CICE and its Elastic-Viscous-Plastic (EVP) dynamical solver. The study has been conducted in two steps. First, the standard EVP solver has been extracted from CICE for experiments with refactored versions of it. Secondly, one refactored version was integrated and tested as part of the full model. Two dominant bottlenecks were revealed. The first is the number of MPI and OpenMP synchronization points required for halo exchanges during each time-step combined with the irregular domain of active sea ice points. The second is the lack of Single Instruction Multiple Data (SIMD) code generation.
The study refactors the standard EVP solver based on two generic patterns. The first pattern exposes how general finite-differences on masked multi-dimensional arrays can be expressed in order to produce significantly better code generation. The primary change is that the memory access pattern is changed from random access to direct access. The second pattern exposes an alternative approach to handle static grid properties.
The measured single core improvement is increased by more than a factor of five compared to the standard implementation. The refactored implementation strong scales on the Intel® Xeon® Scalable Processor Series node until the available bandwidth of the node is used. For the Intel® Xeon® CPU Max Series Series there is sufficient bandwidth to allow the strong scaling to continue for all the cores on the node resulting in a single node improvement factor of 35 over the standard implementation.
This study also show improved performance on GPU processors.
- Preprint
(1783 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CEC1: 'Comment on gmd-2024-40 - No compliance with GMD's policy', Juan Antonio Añel, 11 May 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlYou have not published the new code developed for your manuscript. This is mandatory before submission of the manuscript according to our policy, and you must include in the Code and Data Policy the link and DOI for the permanent repository where it is published.
Therefore, please, publish your code in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available before the Discussions stage. Also, please, include the relevant primary input/output data.
In this way, if you do not fix this problem, we will have to reject your manuscript for publication in our journal. I should note that, given this lack of compliance with our policy, your manuscript should not have been accepted in Discussions. Therefore, the current situation with your manuscript is irregular.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2024-40-CEC1 -
AC1: 'Reply on CEC1', Till Rasmussen, 11 May 2024
Dear Juan Antonio Añel/Editor
I replied through email in order to clarify the missing items. I hope that you see this email.
Best regards and thank you
Till Rasmussen
Citation: https://doi.org/10.5194/gmd-2024-40-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 12 May 2024
Dear authors,
I have not received such email. Anyway, Discussions is the right forum to discuss the issues of your submitted work, so put your comments here.
Regards,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2024-40-CEC2 -
AC2: 'Reply on CEC2', Till Rasmussen, 12 May 2024
Dear Editor
We acknowledge your comment and will do an effort to fix this but we are not entirely sure what is needed as we have added the reference for the unit test which are the primary part of the work.The unit test that all performance test are build on are located as referenced in Rasmussen (24), which refers to a zenodo archive.Rasmussen, T. A. S., Poulsen, J., Ribergaard, M. H., & Rethmeier, S. (2024). dmidk/cice-evp1d: Unit test refactorization of EVP solver CICE (refactorevp1d_v0.1). EGU V. Zenodo. https://doi.org/10.5281/zenodo.10782548Maybe there is a mistake in the reference Rasmussen (24) that is referred to in the code availability of the manuscript?This is the main repository for this manuscript and it includes the source code needed in order to to the performance test.Integration testThe software is integrated back into CICE version 6.5.0. This version number is usually updated once or twice a year, however no new version has been added since this was added to the code. The code is accepted in the repository, see:https://github.com/CICE-Consortium/CICE/tree/main/cicecore/cicedyn/dynamicsDo we need a new version number or can we "just" refer to a certain tag?Best regards and thank you for your replyTill RasmussenCitation: https://doi.org/10.5194/gmd-2024-40-AC2 -
CEC3: 'Reply on AC2', Juan Antonio Añel, 13 May 2024
Dear authors,
We can not accept a tag in GitHub for your submitted work. The fact that the official repository of the model has not been updated is not an excuse to publish the code if you want to submit a manuscript describing it to our journal. You must publish the model and the code developed in a permanent repository that complies with our policy, GitHub does not do it. Otherwise, we will have to reject your manuscript for publication.
Regards,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2024-40-CEC3 -
AC3: 'Reply on CEC3', Till Rasmussen, 23 May 2024
Dear Editor
We have now collected the source code and data sets for demonstration of the 1d evp solver. The title should be changed in order to refer to the new version of cice (6.5.1)
The references are:
Unit code for performance results
Rasmussen, T. A. S., Poulsen, J., Ribergaard, M. H., & Rethmeier, S. (2024). dmidk/cice-evp1d: Unit test refactorization of EVP solver CICE (refactorevp1d_v0.1). EGU V. Zenodo. https://doi.org/10.5281/zenodo.10782548
Input data for these
Rasmussen, T. A. S. (2024). Input data for 1d EVP model [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11248366
Reference to cice 6.5.1
Elizabeth Hunke, Richard Allard, David A. Bailey, Philippe Blain, Anthony Craig, Frederic Dupont, Alice DuVivier, Robert Grumbine, David Hebert, Marika Holland, Nicole Jeffery, Jean-Francois Lemieux, Robert Osinski, Steketee, A., Till Rasmussen, Mads Ribergaard, Roach, L., Andrew Roberts, Matthew Turner, … Worthen, D. (2019). CICE-Consortium/CICE: CICE Version 6.5.1 (6.5.1). Zenodo. https://doi.org/10.5281/zenodo.11223920
Input data for QC test
CICE Consortium. (2021). CICE gx1 Grid and Initial Condition Data - 2021.08.16 (2021.08.16) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5208241
CICE Consortium. (2023). CICE gx1 JRA55do Forcing Data by year - 2023.07.03 (2023.07.03) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8118062
Best regards Till
Citation: https://doi.org/10.5194/gmd-2024-40-AC3 -
CEC4: 'Reply on AC3', Juan Antonio Añel, 23 May 2024
Dear authors,
Many thanks for replying, storing the code in an acceptable repository, and solving this situation. We can consider now that the current version of your manuscript complies with our code and data policy.
Regards,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/gmd-2024-40-CEC4
-
CEC4: 'Reply on AC3', Juan Antonio Añel, 23 May 2024
-
AC3: 'Reply on CEC3', Till Rasmussen, 23 May 2024
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 13 May 2024
-
AC2: 'Reply on CEC2', Till Rasmussen, 12 May 2024
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 12 May 2024
-
AC1: 'Reply on CEC1', Till Rasmussen, 11 May 2024
-
RC1: 'Comment on gmd-2024-40', Anonymous Referee #1, 24 May 2024
General comments
This paper provides a detailed description of the optimization of a component, the EVP dynamical solver, within a sea ice model for Earth system modeling, specifically focusing on Intel hardware, including CPUs and GPUs. The authors have achieved a 5x performance improvement on the specified hardware. While the optimization techniques are not new, their application to this particular software is, to the best of the referee's knowledge, original. The paper is well written and clearly written and is certainly of general interest.
Specific comments
- All the performance results are normalized to the reference implementation. Although the comparison to the stream benchmark makes the argument convincing about the memory bandwith optimization a few absolute numbers in terms of achieved memory bandwith vs theoretical could make the paper of broader interest. If some performance metric numbers are available it would be good to provide some, or show some of the optimizations on roofline model plot.
- The paper is only considering Intel Hardware. This is ok, also considering that one of the Author is from Intel, nonetheless further comments should be made on applicability to other harware, in particular Nvidia or AMD GPUs since this may be more broadly available in the ESM HPC community. There is refference to one-API it is not clear from the paper if the openmp target is using some proprietray extension or if this should work out of the box on other hardware.- l.369 : "3.8 when AVX2 is used and 5.1 when aiming at AVX512" most of the argumentation is that performance increased is gained because of better memory usage. It is therefore not clear why there is such a gain from vectorization, this would need a comment
Technical corrections
- p1. one could say in the abstract and maybe in the title that CICE is a sea ice model
- l.446 : "it enforces lower frequency" : maybe comment if this is a strict hardware requirement, or could this be deactivated ?
- l.464: "The goal is improve performance of the full model " -> "The goal is to improve performance of the full model "
Citation: https://doi.org/10.5194/gmd-2024-40-RC1 -
AC4: 'Reply on RC1', Till Rasmussen, 08 Jun 2024
We would like to thank the reviewer for the comments. Replies can be found inline below - All the performance results are normalized to the reference implementation. Although the comparison to the stream benchmark makes the argument, convincing about the memory bandwidth optimization a few absolute numbers in terms of achieved memory bandwidth vs theoretical could make the paper of broader interest. If some performance metric numbers were available, it would be good to provide some, or show some of the optimizations on roofline model plot.
Response: We will add absolute numbers as requested. Either as a plot or as a table.
- The paper is only considering Intel Hardware. This is ok, also considering that one of the Author is from Intel, nonetheless further comments should be made on applicability to other hardware, in particular Nvidia or AMD GPUs since this may be more broadly available in the ESM HPC community. There is reference to one-API it is not clear from the paper if the openmp target is using some proprietary extension or if this should work out of the box on other hardware.
Response: All implementation only uses open standards. There are no proprietary extensions used. Therefore, it is expected that similar results can be achieved on e.g. NVIDIA or AMD hardware. We will add a comment in order to clarify this.
- l.369 : "3.8 when AVX2 is used and 5.1 when aiming at AVX512" most of the argumentation is that performance increased is gained because of better memory usage. It is therefore not clear why there is such a gain from vectorization, this would need a comment
Response: SIMD is a matter of both COMPUTE instructions and MEMORY instructions and It takes AVX512 load/store instructions to drive the high Bandwidth of High Bandwidth Memory.
We will add a comment to clarify this
Technical corrections
- p1. one could say in the abstract and maybe in the title that CICE is a sea ice model
Response: This will be added
- l.446 : "it enforces lower frequency" : maybe comment if this is a strict hardware requirement, or could this be deactivated ?
Response: This is a general hardware feature that ensures the hardware do not overheat. It cannot be deactivated. We will add a comment
For Intel this is called RAPL. See e.g. https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html
- l.464: "The goal is improve performance of the full model " -> "The goal is to improve performance of the full model "
Response: This will be corrected
Citation: https://doi.org/10.5194/gmd-2024-40-AC4
-
AC4: 'Reply on RC1', Till Rasmussen, 08 Jun 2024
-
RC2: 'Comment on gmd-2024-40', Anonymous Referee #2, 07 Jun 2024
This paper analyze the performance of Elastic-Viscous-Plastic (EVP) solver from the CICE model. The EVP solver was extracted and refactored in order to improve the performance through the change of memory patterns from random to direct. The speed up is 5 times on 3rd Gen Intel Xeon Scalable processor, 13 times on the 4th Gen Intel Xeon processor and 35 times on the Intel Xeon CPU Max Series due to the memory bandwidth differences. Improved performance can also be seen on GPU processors. This is a good advance for the EVP solver within CICE model development. However, there are some major issues to be addressed.
- This paper tests the CPU and GPU together (Fig. 6). Numerical accuracy is very important for the EVP solver. Is this a fair comparison? Particularly, ESM is very sensitive to the CICE. Line 340-342 mentioned that “bit-for-bit” identical results are achieved in the CICE (or just the algorithm itself). Can the authors confirm that identical results are achieved when they compare the performance, even for the GPU computation? That means you also use double-precision? Did I miss something? If not, can the authors provide the accuracy level for double precision in CICE6? Is the performance affected by the accuracy used? Also, this study only consider the Intel processes. What about the other processes? If the memory bandwidth is the only concern here, the brand-dependence should be small. The authors should address and discuss these issues to ensure the robustness of the proposed change.
- This paper extracts the EVP solver out from CICE6. The major improvement is done only within the single processor. Is this practical for the real application? Particularly, CICE6 is currently using a structured grid. The standard procedure uses indxti(ij) in order to avoid the land points or unused points. Is the refactored algorithm in Listing 3 and 4 effective? For example, if your ee(iw), iw and iw+1 are not pointing to the neighboring points in the array. Also, the size of ee is change at every time step when considering this. Are you sure your algorithm will be faster? These real applications should be discussed.
- This paper only shows the single node performance. However, CICE multi-domain MPI. In practice, CICE uses hundreds of processes. Some discussion should be included on the performance if multi-nodes are used. What about the land or non-active points at different time? These points are changed with time, right? Do you consider the additional time tracking these changes? What about the load balance among different nodes/cores?
- Can the author address the overall performance improvements of CICE based on this new refactorization? Or is the performance test based on the 5 year simulation in CICE gx1 domain? There is no discussion how the performance results are obtained. What’s the model set up? Is this using the standard CICE example? The performance improvement in the global CICE example should be demonstrated. I suggest to give the timing how much this can save the standard CICE gx1 overall simulation time (e.g., 5 year simulation in Fig. 2).
Finally, there are some grammatical errors within the text. The text and discussion also require some reorganization and connection for a better presentation. Some paragraphs have only 1 sentence, e.g., line 365, 395, 426, 508. Is this on purpose? Further improvement in the English and careful proofread by a native speaker are required. This paper is appropriate to be published in GMD after considering the above major issues and the following comments.
- Line 50-54 discuss the number of active sea ice points varies in time. The algorithm doesn’t consider this overhead in CICE. Can the author discuss how that can affect the generated 1D array of ee in real time?
- Line 115, missing word after “Reinders and”. Please check the wording throughout the manuscript.
- Section 3.2.2, the title of this section is “Capacity scaling-OpenMP and OpenMP target”, however, I didn’t see any OpenMP is discussed here.
- Line 437, I do agree that this approach is ideal for unstructured grid. However, CICE is still based on structured grid. Some comments should be included.
- Line 446-450, these paragraphs are unclear. What do you mean by “Combined, they lead to slower overall execution”.
- Line 450, floating point operations are discussed here. Do you use the same floating point operations in CPU and GPU configuration?
- Line 457-460 discusses a major issue about the cross-node communication. In the structured grid, we can easily define the halo-region. However, the current change seems to be more ideal for the unstructured grid arrangement and parallelization. The improvement on a real CICE application can be better demonstrated.
- Line 479-483, this section discusses the MPMD parallelization. However, it is not clear to me how this can be implemented within ESM and CICE. How this discussion connects to this study?
- Line 484-491, this discussion is very important. The overhead in the selected strategy needs to be addressed. Several potential strategies are mentioned but not fully addressed. We can see the refactorization can enhance the single node performance. However, is this approach ideal for a real CICE application? The improvement in the global CICE model should be demonstrated.
Citation: https://doi.org/10.5194/gmd-2024-40-RC2 -
AC5: 'Reply on RC2', Till Rasmussen, 14 Jun 2024
Response: We would like to thank the reviewer for the comments.
Responses to the questions below are marked in bold. We have added relevant comments to the manuscript and clarified where needed based on these comments/questions. Specific comments are answered below.
A general comment: the prime objective of this manuscript is to describe the performance of unit test of EVP and not the full CICE model. The aim of the implementation within CICE is to demonstrate that it is functional and not to show the ”perfect” performance. Improved performance in the full system requires handling of e.g. MPI, as is also noted by the reviewer. Performance of the full CICE model is beyond the scope of this manuscript.
This paper analyze the performance of Elastic-Viscous-Plastic (EVP) solver from the CICE model. The EVP solver was extracted and refactored in order to improve the performance through the change of memory patterns from random to direct. The speed up is 5 times on 3rd Gen Intel Xeon Scalable processor, 13 times on the 4th Gen Intel Xeon processor and 35 times on the Intel Xeon CPU Max Series due to the memory bandwidth differences. Improved performance can also be seen on GPU processors. This is a good advance for the EVP solver within CICE model development. However, there are some major issues to be addressed.
This paper tests the CPU and GPU together (Fig. 6). Numerical accuracy is very important for the EVP solver. Is this a fair comparison?
Response: It is true that performance results are shown for both CPU’s and GPU’s, and it is a bit unfair to compare one CPU node to 1-2 nodes of GPU’s. The aim is to show that the scalability/performance can be achieved on both types of processors, especially with high bandwidth CPU’s, which are somewhat similar to GPU’s.
Comparing different hardware types to each other is not entirely fair as there are other issues that affect the choice of the right hardware. This could be prize or the energy consumption. We will add a comment on this.
Particularly, ESM is very sensitive to the CICE. Line 340-342 mentioned that “bit-for-bit” identical results are achieved in the CICE (or just the algorithm itself). Can the authors confirm that identical results are achieved when they compare the performance, even for the GPU computation?
Response: Bit-for-bit reproducibility is ONLY measured for the EVP kernel and only across the 7 different implementations on the CPU. Small changes to the results are expected when hardware is changed (e.g. CPU to GPU). In addition it is not possible to run the baseline test (v0) on GPU’s, thus it is not possible to make a bit for bit comparison for GPU’s.
The aim of the test is to verify the correctness of the EVP refactorization. First, it ensures that the changes to the Fortran code and the data structures used does not change the results. Second, it ensures that the threading introduced in the code does not change the results. This paper focuses on the EVP algorithm, but the considerations will be the same when bit-for-bit is discussed for the full CICE model.
The algorithm is bit for bit with conservative flags and no optimization. When optimization flags are used, bit-for-bit results can not be guaranteed, but the solution should not change significantly. This is the aim of the five year run with CICE shown in figure 2. The implementation into CICE is not the focus, but it was necessary in order to show that the full model was still functioning correctly with the new algorithm. Optimization of the implementation into the full CICE code is out of the scope for this manuscript. Efficiency of the integration is our next step. This step will also allow for simulations on multiple nodes.
That means you also use double-precision? Did I miss something? If not, can the authors provide the accuracy level for double precision in CICE6? Is the performance affected by the accuracy used?
Response: Yes we use double precision (same as in the rest of CICE), and this study is confined to that. Performance is affected by the choice of accuracy (single precision vs double precision). Lower accuracy would improve the performance as it puts less pressure on the bandwidth. If one wants to implement this, it is important to check that no significant changes are seen in the results. For this reason, some variables may at all times have to be kept as double precision.
Also, this study only consider the Intel processes. What about the other processes? If the memory bandwidth is the only concern here, the brand-dependence should be small. The authors should address and discuss these issues to ensure the robustness of the proposed change.
Response: It is correct that the brand or choice of the CPU influences the performance effect. Performance measurements are confined to the hardware and software that is tested. For instance, this study compare two types of hardware where bandwidth is the only difference. This changes performance significantly. None of the changes of the source code is specific to the hardware, and the relative performance of stream Triad on a given CPU will be a good proxy to estimate the performance effect of the refactoring on this CPU.
This paper extracts the EVP solver out from CICE6. The major improvement is done only within the single processor. Is this practical for the real application?
Response: We consider this approach to be rather common (e.g. the ESCAPE projects at ECMWF). It is also efficient to focus on one part of the code and its refactorization. Having said that, it is indeed a valid point that once the refactoring is completed there will be another phase that focuses on improving the integration into CICE. As mentioned, this is beyond the scope of this manuscript, but the discussion includes some thoughts on this topic.
Particularly, CICE6 is currently using a structured grid. The standard procedure uses indxti(ij) in order to avoid the land points or unused points. Is the refactored algorithm in Listing 3 and 4 effective?
For example, if your ee(iw), iw and iw+1 are not pointing to the neighboring points in the array. Also, the size of ee is change at every time step when considering this. Are you sure your algorithm will be faster? These real applications should be discussed.
Response: Yes, the refactored algorithm are indeed effective and significantly faster, as we illustrate with the performance measurements. The stencil operations will point to non-contiguous memory cells, but this is handled with gather operations and reading the data from neighboring cell is not dominant. We quantify the gather reads relatively in the paper to address this concern. Also, please note that there are no write operations into neighboring cells.
This paper only shows the single node performance. However, CICE multi-domain MPI. In practice, CICE uses hundreds of processes. Some discussion should be included on the performance if multi-nodes are used.
Response: Multiple nodes performance is out of scope since this belongs to the upcoming integration phase. The manuscript includes discussion of how we intend to integrate the refactored code properly performance-wise, and only after that point will it make sense to do a multi-node performance study.
What about the land or non-active points at different time?
These points are changed with time, right? Do you consider the additional time tracking these changes? What about the load balance among different nodes/cores?
Response: The conversion back and forth from 2D is not included in the timings. In general the 2D grid is an abstraction for the computer. Ideally the full code would use 1D arrays. Landpoints are always eliminated in the 1D solver and are not tracked. Different approaches can be made to destinguish between active (ice covered cells) and non-active, ice free cells. One can allocate all water points, or only active points. If the number of active increase above a maximum limit, one can reallocate. The most efficient method depends on the domain. Load balancing is tricky for sea ice models in both 1D and in 2D. The 1D vector is more ideal for OMP as there are fewer inactive points. The load balance for MPI is our next step, which we discuss in the manuscript.
Can the author address the overall performance improvements of CICE based on this new refactorization? Or is the performance test based on the 5 year simulation in CICE gx1 domain? There is no discussion how the performance results are obtained. What’s the model set up? Is this using the standard CICE example? The performance improvement in the global CICE example should be demonstrated.I suggest to give the timing how much this can save the standard CICE gx1 overall simulation time (e.g., 5 year simulation in Fig. 2).
Response: Performance testing has been based on a unit test of the EVP solver. The 5 year simulation is only there to show that the integration into CICE fulfills the requirement that the optimized algorithm do not create large changes in the model. The gx1 model is relatively small, thus it is not ideal for performance testing. The RASM and the DMI model setups are used, see figure 1. These are both regional domains but for this purpose, where we want to put pressure on the system there are more active points.
Finally, there are some grammatical errors within the text. The text and discussion also require some reorganization and connection for a better presentation. Some paragraphs have only 1 sentence, e.g., line 365, 395, 426, 508. Is this on purpose?
Response
Line 365, 426 is an artifact of the figure being moved
Line 395, 508 will be connected to the paragraph above/below.
Further improvement in the English and careful proofread by a native speaker are required. This paper is appropriate to be published in GMD after considering the above major issues and the following comments.
1.Line 50-54 discuss the number of active sea ice points varies in time. The algorithm doesn’t consider this overhead in CICE. Can the author discuss how that can affect the generated 1D array of ee in real time?
Response: None of our performance tests include the variation within CICE. This is beyond the scope of the paper, since a proper MPI strategy must be implemented first. The generated 1D vectors (including ee) will all have the same length.
2.Line 115, missing word after “Reinders and”. Please check the wording throughout the manuscript.
Response: Fixed
3.Section 3.2.2, the title of this section is “Capacity scaling-OpenMP and OpenMP target”, however, I didn’t see any OpenMP is discussed here.
Response: The heading has been changed to match the content
4.Line 437, I do agree that this approach is ideal for unstructured grid. However, CICE is still based on structured grid. Some comments should be included.
Response: The computer only sees data as 1D arrays in memory regardless of the data structure. We will add comments
5.Line 446-450, these paragraphs are unclear. What do you mean by “Combined, they lead to slower overall execution”.
Response: To be updated
6.Line 450, floating point operations are discussed here. Do you use the same floating point operations in CPU and GPU configuration?
Yes, the Fortran code within the OpenMP scope is the same for both CPU and GPU. The only difference is the OpenMP decoration around it
Line 457-460 discusses a major issue about the cross-node communication. In the structured grid, we can easily define the halo-region. However, the current change seems to be more ideal for the unstructured grid arrangement and parallelization. The improvement on a real CICE application can be better demonstrated.
Response: It is beyond the scope of the manuscript to optimize the MPI part in CICE. This is for the next phase. That being said, it is not optimal to have hundreds of MPI communications per time step and limited calculations within the EVP solver. This is the reason why OMP is by far the preferred communication method as long as it is on one node. The current OMP in CICE does not change the communications as the blocks do not share memory. The plan is to implement local MPI for the different parts of CICE.
7.Line 479-483, this section discusses the MPMD parallelization. However, it is not clear to me how this can be implemented within ESM and CICE. How this discussion connects to this study?
Response: The aim of MPMD is to run with local MPI configurations within the different parts of CICE or the ESM. This will allow each part of the code to be executed on a limited number of nodes. This simplifies the communication and make it cheaper. We have clarified this point in the text.
8.Line 484-491, this discussion is very important. The overhead in the selected strategy needs to be addressed. Several potential strategies are mentioned but not fully addressed. We can see the refactorization can enhance the single node performance. However, is this approach ideal for a real CICE application? The improvement in the global CICE model should be demonstrated.
Respone: It is beyond the scope to demonstrate the full performance. We have clarified this point in the text.
Citation: https://doi.org/10.5194/gmd-2024-40-AC5
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
317 | 61 | 39 | 417 | 19 | 21 |
- HTML: 317
- PDF: 61
- XML: 39
- Total: 417
- BibTeX: 19
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1