A New End-to-End Workflow for the Community Earth System Model (version 2.0) for CMIP6

The complexity of each Coupled Model Intercomparison Project grows with every new generation. The Phase 5 effort saw a large increase in the number of experiments that were performed and the number of variables that were requested compared to its previous generation, Phase 3. Many centers were not prepared for the large demand and this stressed the resources of several centers including at the National Center for Atmospheric Research. During Phase 5, we missed several deadlines and we struggled to get the data out to the community for analysis. In preparation for the current generation, Phase 5 6, we examined the weaknesses in our workflow and addressed the performance issues with new software tools. Through this investment, we were able to publish approximately six times the amount of data to the community compared to the volumes we produced in the previous generation and we were able to accomplish this within one-third of the time, providing an 18 times speedup. This paper discusses the improvements we have made to accomplish this success for Phase 6 and further improvements we hope to make for the next generation. 10

. This flowchart describes the tasks that are executed within the CESM workflow in order to generate data for CMIP. The diagnostics task can be executed several times while the CESM model run is executed. The remainder of the tasks are each executed once when the CESM model run completes. set of operations, none of them satisfied all of the necessary requirements. For example while CDO minimized the number of times output files were opened and closed it did not easily enable parallel execution. Conversely while Pagoda offered parallel execution it did not minimize the number of open and closes. The XML IO Server (XIOS) (Meurdesoif, 2020) is an 60 IO library that is able to write publication ready output directly from the model. While XIOS provides excellent performance, implementing this method would have required us to rewrite the IO interface within all of the modeling components and this would have taken more FTEs than were allotted for this project. We therefore decided to develop our own tools based on Python and the Message Passing Interface library (MPI) (Gropp et al., 1999) to enable parallelism. We choose to use Python because of its flexibility, available libraries, and quick prototyping ability (Perez et al., 2011;Oliphant, 2007) and MPI4Py (Dalcin, 2019) library to enable parallelism.
The publication of CMIP5 data contributions to the ESGF was also a bottle neck within the data workflow. During CMIP5, the ESGF software stack was stressed when large amounts of data were trying to be published by multiple organizations at the same time. Over the past few years, a team of individuals from around the world have been improving the ESGF software stack (Abdulla, 2019). These process improvements, along with the post-processing tools we developed are described in the 70 following subsections.

Time Series Generation
The first step within our post-processing workflow involved a transformation of the raw CESM output data from time slice into time series. This operation is represented in the "Convert Output Data into Time Series Data" task within Figure 1. Each of the CESM components produces output files that contain multiple variables in one time slice chunk. Unfortunately this is not 75 an ideal format for distribution, because scientists are typically interested in evaluating a handful of variables at multiple time steps. Instead the data are reformatted into a time-series format, where each file contains one or more time slices of a single variable.
Interestingly, the conversion of time-slice to time-series data was the single most expensive component of the CMIP5 workflow. While this operation is embarrassingly parallel due to the lack of data dependencies between each variable, the serial 80 CMIP5 workflow used individual NCO commands that opened, read and wrote each individual time slice. Consider the number of file operations necessary to convert an entire data-set which contains num T S (number of time slices) and num var (number of variables) from time-slice to time-series. Using the serial CMIP5 workflow, the execution consisted of 2×num T S ×num var open and close operations and num T S × num var read and write operations. We were able to significantly reduce the number of these expensive disk I/O operations through the creation the PyReshaper (Paul et al., 2015. We next describe the 85 PyReshaper tool which was adopted into the CESM post-processing framework (Bertini and Mickelson, 2019) The approach used by PyReshaper is illustrated in Figure 2. An MPI rank is assigned one or more fields to read from the time-slice file and write to the time-series file. Each MPI rank i operates independently and performs num i var * num T S + 1 open and close operations and num i var * num T S + 1 read and write operations where num i var is the number of fields assigned to MPI rank i. Given a sufficient amount of memory, it is possible to further reduce the number of write operations by writing 90 multiple time slices to the filesystem in a single call. This task base parallelism supports execution on as many MPI ranks as there are fields in the input data set. Ideally if all the input fields were the same size and the cost to read the data from and write the data to the filesystem was negligible it would be possible to achieve a maximum speedup of num var . Unfortunately the size of all input fields are not the same and the cost of read data from and write data to the filesystem is not negligible. We next describe the actual speedup the PyReshaper approach enables.

95
In the performance evaluation of PyReshaper, we evaluated the time it took to convert 10 years of monthly atmospheric data into the time-series format. This test configuration represents the conversion of approximately 180 GBytes of input data. The conversion took approximately 5 1 ⁄2 hours using the existing serial method on NCAR's Cheyenne (Cheyenne, 2017) supercomputer. Figure 3 illustrates the performance improvements of the PyReshaper tool over the existing method. Note that using 144 MPI ranks we achieve the same conversion in approximately 4 1 ⁄2 minutes.

100
The large improvement seen between 144 ranks and 72 ranks is an indication of a load-imbalance in the partitioning of fields to MPI ranks within the PyReshaper tool. Because the algorithm does not take into account any difference in processing cost between variables, some ranks can end up with more expensive three-dimensional variables to process while others may get only two-dimensional variables.

105
One of the main ways the NCAR scientists evaluate the output of CESM is to run the component diagnostic packages. This task is represented by the "Diagnostics" task within Figure 1. They consist of four separate packages which are used to evaluate atmosphere, ocean, ice, and land model output. Each of these packages creates a set of average files, or climatology files, plots the data against observations or another model run, and then creates an HTML document that links all of the plot image files.
To do this, each of these packages used a combination of shell scripts, NCO, and NCL (NCL, 2019) to analyze the data. Each    package requires different types of climatologies and plot types which creates unique performance characteristics for each of the packages. While previous efforts have enabled parallelism in the workflow (Woitaszek et al., November 2011;Jacob et al., 2012), this work presented it own set of issues. Specifically this approach resulting in poor performance for multiple file operations, and it had a steep learning curve for users. In order to create the climatology files in parallel and to reduce the expensive disk I/O operations, we developed the tool PyAverager (Paul et al., 2015;Mickelson et al., 2018). We also chose to 115 call the NCL plotting scripts in parallel in order to improve performance further.
The parallelism strategy the PyAverager uses is illustrated in Figure 4. When the application begins, the pool of MPI ranks are partitioned into subcommunicators and the climatologies to be computed are partitioned across all subcommunicators. One MPI rank in each subcommunicator is assigned to be the writer of the given climatology file. Then, the field list is partitioned across the remainder of MPI ranks within the subcommunicator. Each of these ranks is responsible for retrieving its assigned 120 field, computing the correct climatology, and then sending the result to the writer. After all fields have been written, the subcommunicator group begins computing the next climatology file it was assigned.
The second part of the diagnostics involves creating plots from the climatologies that were created. The plotting scripts individually can take a long time to run and run times vary among the plotting scripts. In order to improve the performance further, the CESM post-processing framework calls the individual NCL scripts in parallel. Because there are no data dependencies 125 Figure 5. The comparative speedup of creating climatology files from ten years of monthly atmospheric data. Four seasonal and twelve monthly climatology files were created. For comparison, the original methods took approximately 26 1 ⁄2 minutes to generate the climatologies.
The PyAverager took approximately 46 seconds to create the same climatologies with 64 MPI ranks.
within the scripts, we are able to execute them in parallel. Therefore, if we have as many MPI ranks available as we do plotting scripts, the performance is limited to the longest running script.
In order to evaluate the performance improvements, we compared the time it took to create twelve monthly and four seasonal climatology files with the PyAverager against the NCO tools ran in serial. We chose to operate on the same data that was used to evaluate the performance of the PyReshaper in the previous section and all timings were performed on Cheyenne.

130
You can see from Figure 5 that the PyAverager is able to scale better than the PyReshaper. This is because the problem size is more load balanced. As you recall, the PyAverager distributes the number of averages to be done amongst the available subcommunicators and the number of variables are distributed amongst the ranks within the subcommunicator. For this particular problem, the work is more evenly distributed because the problem sizes were all similar and this lead to the better scaling.
The lack of improvement seen between ranks 16 and 32 is because the work wasn't evenly distributed and a subcommunicator 135 ended up with slightly more work to do.

Conforming Data to Meet Specifications
The final step before publishing the data involves conforming the data to meet experiment specifications. This is represented as the "Conform Data to Experiment" task within Figure 1. This requirement is done in order to enable scientists to directly compare the data from different centers without any modifications. Some examples include renaming model variables, com-140 bining fields (e.g., adding or subtracting) to create one output field, converting units, verifying the data resides on the specified grid, and checking that the correct attributes are attached to the files. The method used for CMIP5 required users to write code to make required data transformations and to call the Climate Model Output Rewriter (CMOR) (Taylor et al., 2006) library to check for compliance and to add file attributes. This was usually written as serial code and it took a long time to execute on a large data set. Also, generating the data was an error prone process that required expert knowledge to ensure data integrity. In 145 order to meet the demands of CMIP6, we developed the tool PyConform (Paul et al., 2016 because we needed a tool with a flexible interface, that could create variable output in parallel, and still produced data that met specification requirements. Figure 6. An example of a PyConform job. Each MPI rank is responsible for creating a particular output data set. Its job is to retrieve the variable data it needs, map operations, execute these operations, and then write the data.
An example of a PyConform job is shown in Figure 6. For this application, we again relied on a task-parallel approach, parallelizing across output files. The input fields are found on the left side of the figure. These fields are operated on as they are fed through the system in order to produce the output fields on the right. There are a variety of operations that can be performed 150 on the data and this figure only shows a small subset. Several common functions and arithmetic operations are provided with the tool, but we could not account for all functions users may need. If more functionality is needed, we provide a framework in which users can create their own functions in Python and plug them into the framework.
PyConform depends on the CMIP6 data request Python API, dreqPy (Juckes et al., 2020). This package interfaces with the CMIP6 data request database which contains information regarding all of the fields within the request. This includes field 155 names, descriptions, units, coordinates, and other specific information. Experiment information is also contained within the data request, specifying experiment descriptions and which fields are being requested for that experiment.
In order to evaluate the performance of the PyConform tool, we chose to compare it against the performance of the Fortran code that we used for CMIP5. In this example we were limited to generating only fifty variables because this was the union of variables that matched between CMIP5 and CMIP6 for the atmosphere model. In our evaluation we found that the original method took approximately 9 1/2 minutes to generate the CMIP compliant output and it took PyConform about 1 1/2 minutes to generate the same output using 16 MPI ranks. This provided us with over a six times speedup over existing methods. Since this was a smaller problem, we chose to run the timing tests on a smaller number of ranks. When PyConform was executed in a production mode for CMIP6, it generated thousands of variable files and we are able to scale out to more MPI ranks efficiently.

Data Publication
The final step in the CMIP workflow within Figure 1 is publication of reviewed experiments to the ESGF, which is the data distribution and access platform designated for sharing CMIP and related simulation data. NCAR operates an ESGF Data Node, which is a software application stack that includes tools for checking conformance to the CMIP6 metadata standards, serving NetCDF data files using a Thredds Data Server and Globus Transfer, replication services, automated citation generation 170 and experiment lifecycle support, including data retraction and re-publication.
A significant challenge with CMIP5 data publication was managing the velocity and complexity of data publication using ad hoc communications, such as email. Given the challenges of post-processing noted above, each experiment was published incrementally. This led to multiple versions of experiments and added unnecessary complexity to the publication process. A separate challenge was managing an under-development and changing ESGF software stack during the production CMIP5 data 175 publication. The burden of updating the ESGF node frequently coupled with changing metadata requirements led to further slow-downs in the overall process.
For CMIP6 the ESGF software components were significantly improved, due to the increase in diversity, complexity and volumes being managed, as well as the experiences of data managers and node operators during CMIP5. In addition, a number of new components were developed for CMIP6, including the PrePARE data QC tools, a data replication tool, the Errata 180 Service, and the Citation Service. These components were tested through a series of five "Data Challenges", which NCAR participated in as a member of the CMIP Data Node Operations Team (CDNOT) from January to June 2018. These data challenges were performed in advance of the model data availability and served to harden and improve the ESGF software stack with a series of integration and other system level tests. The significant improvements to the ESGF software stack and related tools vastly improved the rate of data publication for CMIP6. These performance improvements are shown within 185 Figure 7. In the first two months of the CMIP6 publication process, NCAR was able to smoothly publish 50 TB more than it had published in the full 25 months it took to publish data towards the CMIP6 campaign because of the improvements that were made.

Process Workflow
During the completion of the CMIP5 simulations, each of the processes illustrated in Figure 1 were independent tasks, and 190 they were not automatically run in succession. Another problem was that each of the tasks were run by different individuals causing workflows to stop while they waited for someone to start the next task. For a run to have continuous forward progress,