Parallel I/O in Flexible Modelling System (FMS) and Modular Ocean Model 5 (MOM5)

We present an implementation of parallel I/O in the Modular Ocean Model (MOM), a numerical ocean model used for climate forecasting, and determine its optimal performance over a range of tuning parameters. Our implementation uses the parallel API of the netCDF library, and we investigate the potential bottlenecks associated with the model configuration, netCDF implementation, the underpinning MPI-IO library/implementations and Lustre filesystem. We investigate the performance of a global 0.25 resolution model using 240 and 960 CPUs. The best performance is observed when we limit the number of contiguous I/O domains on each compute node and assign one MPI rank to aggregate and to write the data from each node, while ensuring that all nodes participate in writing this data to our Lustre filesystem. These best-performance configurations are then applied to a higher 0.1 resolution global model using 720 and 1440 CPUs, where we observe even greater performance improvements. In all cases, the tuned parallel I/O implementation achieves much faster write speeds relative to serial single-file I/O, with write speeds up to 60 times faster at higher resolutions. Under the constraints outlined above, we observe that the performance scales as the number of compute nodes and I/O aggregators are increased, ensuring the continued scalability of I/O-intensive MOM5 model runs that will be used in our next-generation higher-resolution simulations.


Introduction
Optimal performance of a computational science model requires efficient numerical methods that are facilitated by the computational resources of the high performance computing (HPC) platform. For each calculation in the model, the operating system (OS) must provide sufficient access to the data so that the calculation can proceed without interruption. This is particularly true in highly parallelized models on HPC cluster systems, where the calculations are distributed across multiple compute nodes, often with strong data dependencies between the individual processes. I/O operations represent such a bottleneck, where one must manage the access of potentially large datasets by many processes while also relying on the available interfaces, typically provided by a Linux operating system to a POSIX parallel (or cluster) filesystem such as Lustre and through to distributed storage arrays. A poorly designed model can be limited by the data speed of an individual disk, or a poorly configured kernel may lack a parallel filesystem that is able to distribute the data transfer across multiple disks.
Datasets in climate modelling at the highest practical resolutions are typically on the order of gigabytes in size per numerical field, and dozens of such fields may be required to define the state of the model. For example, a double-precision floating point variable of an ocean model over a grid of approximately 0.1 • horizontal resolution and 75 vertical levels will typically require over 5 GiB of memory per field. Over 20 such fields may be necessary to capture the model state and preserve bitwise reproducibility, and the periodic storage of model output may involve a similar number of variables per diagnostic time step. A typical 1-year simulation can require reading hundreds of gigabytes of input data and can produce terabytes of model output. For disk speeds of 350 MB s −1 , a serial transfer of each terabyte would take approximately 1 h and can severely burden the model runtime. For such models, efficient I/O parallelization is a critical requirement, and the increase in future scalability requires further improvements in I/O efficiency. Parallel I/O can describe any skilful decomposition of the reading and writing of data across multiple threads, processes, compute nodes or physical storage. Many climate models, and ocean models in particular, can be characterized as hyperbolic partial differential equation (PDE) solvers, which are naturally decomposed into numerically solvable subdomains with only local data dependencies (Webb, 1996;Webb et al., 1997), and it is natural to consider parallel I/O operations which follow a similar decomposition.
In short, there are four fundamental approaches to model I/O, each with its respective tradeoffs, which are outlined in Table 1. The first three approaches are common when using a single file per process, although multiple problems can arise regardless of whether the I/O operation is single-threaded or distributed (Shan and Shalf, 2007).
In the simplest and most extreme case, the field is fully decomposed to match the computational decomposition of the model, so that the data used by each process element (PE), such as an MPI rank or an OpenMP thread, are associated with a separate file, i.e. "distributed I/O, single file per PE in Table 1". An example decomposition is shown in Fig. 1, where the numbered black squares denote the computational domain of each PE. I/O operations in this case are fully parallelized. But this can also require an increasing number of concurrent I/O operations, which can produce an abnormal load on the OS and its target filesystems when such a model is distributed over thousands of PEs (Shan and Shalf, 2007). It can also result in datasets which are distributed over thousands of files, which may require significant effort to either analyse or reconstruct into a single file.
At the other extreme, it is possible to associate the data of all PEs with a single file, denoted by the red border in Fig. 1. One method for handling single-file I/O is to allow all PEs to directly write to this file. Although POSIX I/O permits concurrent writes to a single file, it can often compound the issues raised in the previous case, where resource contentions in the filesystem must now be resolved alongside any contentions associated with the writing of the data itself. Such methods are rarely scalable without considerable attention to the underlying resource management, and hence we do not consider this method in the paper.
A more typical approach for single-file I/O is to assign a master PE which gathers data from all ranks and then serially writes the data to the output file. That is, "single-threaded, single file" in Table 1. While this approach avoids the issues of filesystem resourcing outlined above, it also requires either an expensive collective operation and the storage of the entire field into memory or a separation of the work into a sequence of multiple potentially expensive collectives and I/O writes. These two options represent the traditional tradeoff of memory usage versus computational performance, and both are limited to serial I/O write speeds.
In order to balance the desire for parallel I/O performance while also limiting the number of required files, one can use a coarser decomposition of the grid which groups the local domains of several PEs into a larger "I/O domain", i.e. "distributed I/O, single file per I/O domain" in Table 1. A representative I/O domain decomposition, with I/O domains delineated by the yellow borders, is shown in Fig. 1. Within each I/O domain, one PE is nominated to be responsible for the gathering and writing of data. This has the effect of reducing the number of I/O processes to the number of I/O domains, while still permitting some degree of scalability from the concurrent I/O. Several models and libraries provide implementations of I/O domains, including the model used in this study (Maisonnave et al., 2017;Dennis et al., 2011). A similar scheme for rearranging data from compute tasks to selective I/O tasks is proposed and implemented in the PIO (parallel I/O) library, which can be regarded as an alternative implementation of I/O domains (Edwards et al., 2019).
Because the I/O domain decomposition will produce fields that are fragmented across multiple files, this often requires some degree of preprocessing. For example, any model change which modifies the I/O domain layout, such as an increase in the number of CPUs, will often require that any fragmented input fields be reconstructed as single files. A typical 0.25 • global simulation can require approximately 30 min of postprocessing time to reconstruct its fields as single files; for global 0.1 • simulations, this time can be on the order of several hours, often exceeding the runtime of the model which produced the output.
One solution, presented in this paper, is to use a parallel I/O library with sufficient access to the OS and its filesystem, which can optimize performance around such limitations and provide efficient parallel I/O within a single file, i.e. "paral-  (Howison et al., 2010). This eliminates the need for writer PEs to allocate large amounts of memory and also avoids any unnecessary postprocessing of fragmented datasets into single files, while also presenting the possibility of efficient, scalable I/O performance when writing to a parallel filesystem.
In this paper, we focus on a parallel I/O implementation for the Modular Ocean Model (MOM), the principal ocean model of the Geophysical Fluid Dynamics Laboratory (GFDL) (Griffies, 2012). As MOM and its Flexible Modelling System (FMS) provide an implementation of I/O domains, it is an ideal platform to assess the performance of these different approaches in a realistic model simulation. For this study, we focus on the MOM5 release, although the work remains relevant to the more recent and dynamically distinct MOM6 model, which uses the same FMS framework.
We present a modified version of FMS which supports parallel I/O in MOM by using the parallel netCDF API, and we test two different netCDF implementations: the PnetCDF library (Li et al., 2003) and the pHDF5-based implementation of netCDF-4 (Unidata, 2015). When properly configured to accommodate the model grid and the underlying Lustre filesystem, both libraries demonstrate significantly greater performance when compared to serial I/O, without the need to distribute the data across multiple I/O domains.
In order to achieve the satisfied parallel I/O performance, it is necessary to determine the optimal settings across the hierarchy of I/O stack, including the user code, high-level I/O libraries, I/O middleware layer and parallel filesystem. There is a large number of parameters at each layer of the I/O stack, and the right combination of parameters is highly dependent on the application, HPC platform, problem size and concurrency. Designing and conducting the I/O tuning benchmark is the key task of this work. It is of particular relevance to MOM/FMS users who experience a bottleneck caused by I/O performance. But given the ubiquity of I/O in the HPC domain, the findings will be of interest to most researchers and members of the general scientific community.
The paper is outlined as follows. We first describe the basic I/O implementation of the FMS library and summarize our changes required to support parallel I/O. The benchmark process and tuning results are described and presented in the following section. Finally, we verify the optimal I/O parameter values by applying them to an I/O-intensive MOM simulation at higher resolution.

Parallel I/O implementations in FMS
The MOM source code, which is primarily devoted to numerical calculation, will rarely access any files directly and instead relies on FMS functions devoted to specific I/O tasks, such as the saving of diagnostic variables or the reading of an existing input file. Generic operations for opening and reading of file data occur exclusively within the FMS library, and all I/O tasks in MOM can be regarded as FMS tasks.
Within FMS, all I/O operations over datasets are handled as parallel operations and are accessed by using the mpp module, which manages the MPI operations of the model across ranks. The API resembles most POSIX-based I/O interfaces, and the most important operations are the mpp_open, mpp_read, mpp_write and mpp_close functions, which are outlined below.
Files are created or opened using the mpp_open function, which sets up the I/O control flags and identifies which ranks will participate in I/O activity. Each rank determines whether or not it is assigned as a master rank of its I/O domain and, if so, opens the file using either the netCDF nf_create or nf_open functions.
The mpp_write interface is used to write data to a file, and it supports fields of different data types and numbers of dimensions. Non-distributed datasets are contiguous in memory and are typically saved on every PE, and such fields are directly passed to the write_record function, which uses the appropriate netCDF nf_put_var function to write its values to disk.
When used with distributed datasets, mpp_write must contend with both the accumulation of data across ranks and the non-contiguity of the data itself, due to the values along the boundaries (or "halos") of the local PE domains, which are determined by the neighbouring PEs. The mpp_write function supports the various I/O methods described in Sect. 1. For single-threaded I/O, the data on each PE must first be stripped of local halo data from the field, and the interior values are copied onto a local contiguous vector. These vectors are first gathered onto a single master rank, which passes the data to the write_record function. The alternative is to use I/O domains, where each rank sends its data to the master PE of its local I/O domain in the same manner as the single-threaded method but where each I/O domain writes to its own file. When using I/O domains, a postprocessing step may be required to reconstruct the domain output into a single file.
The mpp_read function is responsible for reading data from files and is very similar to mpp_write in most respects, including the handling of distributed data. In this function, read_record replaces the role of write_record, and the netCDF nf_get functions replace the nf_put functions.
When I/O operations have been completed, mpp_close is called to close the file, which finalizes the file for use by other applications. This is primarily a wrapper to the netCDF nf_close function.
Because FMS provides access to distributed datasets as well as a mechanism for collecting the data into larger I/O domains for writing to disk, we concluded that FMS already contained much of the functionality provided by existing parallel I/O libraries and that it would be more efficient to generalize the I/O domain for both writing to files and passing data to a general-purpose IO libraries such as netCDF. By using FMS directly, there is no need to set up a dedicated I/O server with extra PEs, as done in other popular parallel I/O libraries such as XIOS (XIOS, 2020).
The major code changes relevant to the parallel I/O implementations are outlined below: -All implementations are fully integrated into FMS and are written in a way to take advantage of existing FMS functionality.
-netCDF files are now handled in parallel by invoking the nc_create_par and nc_open_par functions in the FMS file handler, mpp_open.
-All fields are opened with collective read/write operations, via the NF_COLLECTIVE tag. This is a requirement for accessing variables with unlimited time axis and also a necessary setting to achieve good I/O performance. When possible, the pre-filling of variables is disabled to shorten the file initialization time.
-Infrastructure for configuring MPI_Info settings has been added to allow fine tuning of the I/O performance at the MPI-IO level.
-The root PEs of I/O domains, which we identify as I/O PEs, are grouped into a new communicator via FMS subroutines and used to access the shared files in parallel.
-The FMS subroutine write_record is modified to specify the correct start position and size of data blocks in the I/O domain for each I/O PE. Development required approximately 1 month to implement a working feature, along with an additional month of work to troubleshoot more complex configurations related to land masking and the handling of I/O domains which only cover a subset of the total grid.

Parallel I/O performance benchmark
On large-scale platforms, I/O performance optimization relies on many factors at the architecture level (filesystem), the software stack (high-level I/O libraries) and the application (access patterns). Moreover, external noise from application interference and the OS can cause performance variability, which can mask the effect of an optimization.
Obtaining good parallel I/O performance on a diverse range of HPC platforms is a major challenge, in part because of complex interdependencies between I/O middleware and hardware. The parallel I/O software stack is comprised of multiple layers to support multiple data abstractions and performance optimizations, such as the high-level I/O library, the middleware layer and a parallel filesystem (Lustre, GPFS, etc.). A high-level I/O library translates data structures of the application into a structured file format, such as netCDF-3 or netCDF-4. Specifically, PnetCDF and parallel HDF5 are the parallel interfaces to the netCDF-3 and netCDF-4 file formats, respectively, and they are built on top of MPI-IO. The middleware layer, which in our case is an MPI-IO implementation, handles the organization and access optimization from many concurrent processes. The parallel filesystem handles any accesses to files stored on the storage hardware in data blocks.
While each layer exposes tuneable parameters for improving performance, there is little guidance for application developers on how these parameters interact with each other and how they affect the overall I/O performance. To address this, we select combinations of tuneable parameters at multiple I/O layers covering parallelization scales, application I/O layout, high-level I/O libraries, netCDF formats, data storage layouts, MPI-IO and the Lustre filesystem. Although there is a large space of tunable parameters at all layers of the parallel I/O stack, many parameters interact with each other, and only the leading ones need to be investigated.

I/O parameter space
With over 20 tunable parameters across the parallel I/O stack, it can become intractable to independently tune every parameter for a realistic ocean simulation. In order to simplify this process, we conduct a pre-selection process by executing a stand-alone FMS I/O program (test_mpp_io) which tests most of the fundamental FMS I/O operations over a domain of a size comparable to the lower-resolution MOM5 benchmarks. After running this simplified model over the complete range of I/O parameters, we found that most of the parameters had no measurable impact on performance, and we were able to reduce the number of relevant parameters to the list shown in Table 2, which are summarized below: -Application. As described in the introduction, the io_layout parameter is used to define the distribution of I/O domains in FMS. In the original distributed I/O pattern, multiple PEs are grouped into a single I/O domain within which a root I/O PE collects data from the other PEs and writes them into a separate file. In our parallel I/O implementation, the I/O domain concept is preserved in that data are still gathered from each I/O domain onto its root PE. The main difference is that these I/O PEs now direct their data to the MPI-IO library, which controls how the data are gathered and written to a single shared file. Retaining the I/O domain structures allows the application to reorganize data in memory prior to any I/O operations and enables more contiguous access to the file.
-High-level I/O library. In general, the data storage layout should match the application access patterns in order to achieve significant I/O performance gains. The data layout of netCDF-3 is contiguous, whereas netCDF-4 permits more generalized layouts using blocks of contiguous subdomains (or "chunks"). To simplify the I/O tuning, we use the default chunking layout of netCDF-4 files, so that we can focus on the impact of other I/O parameters. We consider the impact of chunking on performance in the high-resolution benchmark.
-MPI-IO. There are many parameters in the MPI-IO layer that could dramatically affect the I/O performance. MPI-IO distinguishes between two fundamental styles of I/O: independent and collective. We only consider collective I/O in this work as it is required for accessing netCDF variables with unlimited dimensions (typically the time axis). All configurable settings for independent functions are thus excluded. The collective I/O functions require process synchronization, which provides an MPI-IO implementation the opportunity to coordinate processes and rearrange the requests for better performance. For example, as the high-performance portable implementation packaged in MPICH and OpenMPI, ROMIO has two key optimizations, data sieving and collective buffering, which have demonstrated significant performance improvements over uncoordinated I/O. However, even with these improvements, the shared file I/O performance is still far below the single-file-per-process approach. Part of the reason is that shared file I/O incurs higher overhead due to filesystem locking, which can never happen if a file is only accessed by a unique process. In order to reduce such overhead, it is necessary to tune the collective operations. By reorganizing the data access in memory, collective buffering assigns a subset of client PEs as I/O aggregators. These aggregators gather smaller, non-contiguous accesses into a larger contiguous buffer and then write the buffer to the filesystem (Liao and Choudhary, 2008). Both I/O aggregators and collective buffer size can be set through MPI info objects (Thakur et al., 1999). For example, the number of aggregators per node is controlled by the MPI-IO hint cb_config_list, and the total number of aggregators is specified in cb_nodes. To simplify the benchmark configuration, we always set cb_nodes to the total number of PEs and leave cb_config_list to control the actual aggregator distribution over all nodes. The collective buffer size, cb_buffer_size, is the size of the intermediate buffer on an aggregator for collective I/O. We initially set the value to 64 kB in the lower-resolution model and then evaluate its impact on the I/O performance of the higher-resolution model.
-Lustre Filesystem. The positioning of files on the disks can have a major impact on I/O performance. On the Lustre filesystem, this can be controlled by striping the file across different OSTs (object storage targets). The Lustre stripe count, striping_factor, specifies the number of OSTs over which a file is distributed, and the stripe size, striping_unit, specifies the number of bytes written to an OST before cycling to the next OST. As there is limit of 165 stripes for a shared file on our Lustre filesystem, we set a range of stripe counts up to 165 to align the number of nodes. The stripe size should generally match the data block size of I/O operations (Turner and McIntosh-Smith, 2017); we find that the stripe size had limited effects on the write performance, and the default 1 MiB gave satisfactory I/O performance in our preselection process.

Configurations
The parallel I/O performance benchmark configurations are set up as shown in Table 3. Project size details: we run a suite of 1 d simulations of the 0.25 • global MOM-SIS model for each of the I/O parameters in Table 3. We then apply these results to a 1 d simulation of 0.1 • models and validate the parallel I/O performance benefits. Each simulation is initialized with prescribed temperature and salinity fields and is forced by prescribed surface fields. The compute domain is represented by the horizontal grid sizes of 1440 × 1080 and 3600 × 2700 for the 0.25 and 0.1 • models, respectively. Both configurations use a common 50-level vertical grid. Model output consists of several restart files in double-precision format and a diagnostic output file in single-precision format. In order to produce significant I/O loads for such a short run, diagnostic output is saved after every time step. In the 0.25 • configuration, the model writes 70 GB of data to the diagnostic file over 48 time steps with the 0.25 • configuration and writes 2.7 TB of data over 288 steps with the 0.1 • configuration model. Multiple independent runs are repeated, and the shortest time is shown for each case.
Domain layout: domain layout depends on the total number of PEs in use. Two distinct CPU configurations, 240 and 960 PEs, are considered for the 0.25 • model. The domain layout is 16 × 15 for 240 PEs and 32 × 30 for 960 PEs. In 0.1 • model, grids are distributed over 720 and 1440 PEs with the domain layout of 48 × 15 and 48 × 30, respectively. PEs are equally assigned in node majority along the x direction of the domain layout.
High-level I/O libraries and netCDF formats: the netCDF library provides parallel access to netCDF-4 formatted files based on the HDF5 library and netCDF-3 formatted files via the PnetCDF library. HDF5 maintains two version tracks, 1.8 series and 1.10 series, in order to maintain the file format compatibility and the enabling of new features, such as the collective metadata I/O or virtual datasets. We are interested in checking the I/O performance to access different formats via various libraries as listed in Table 3.
We rely on the FMS I/O timers to measure the time metrics on opening (mpp_open), reading (mpp_read), writing (mpp_write) and closing (mpp_close) files together with the total runtime. The metric time contains both I/O operations and communications for generation of restart and diagnostic files, and it takes the maximum wall time among all PEs. We do not attempt to compensate for variability associated with the Lustre filesystem, such as network activity or file caching, and rely on the ensemble to identify such variability.
Experiments are carried out on the NCI Raijin supercomputing platform. Each compute node consists of two Intel Xeon (Sandy Bridge) E5-2670 processors with a nominal clock speed of 2.6 GHz and containing eight cores, or 16 cores per compute node. Standard compute nodes have 64 GB of memory shared between the two processors. A Lustre filesystem having 40 OSSs (object storage servers) and 360 OSTs is mounted as the working directory via 56 Gb FDR InfiniBand connections.  Table 4.
We can see that all benchmarks are I/O intensive and they are driven by file initialization and writing operations. Specifically, writing a 4D dataset into the diagnostic file takes about 85 % of total elapsed time. All other times are notably shorter than mpp_write.
The time used in writing data into netCDF-4 formatted files is about 10 % longer than creating netCDF-3 formatted files. This reflects the fact that in serial I/O, the root PE holding the global domain data tends to write the file contiguously, and it matches the contiguous data layout of netCDF-3 better than the default block chunking layout of netCDF-4.
Most   In all benchmarks, the elapsed times for writing files in netCDF-4 and netCDF classic formats are very similar, as both are produced by utilizing the HDF5 1.10.2 library. We will thus report performance among three libraries, i.e. HDF5 1.8.20, HDF5 1.10.2 and PnetCDF 1.9.0.
The mpp_open metric measures both the opening time of input files and the creation time of output files. Its runtime versus I/O layout at 240 and 960 PE benchmarks is shown in Fig. 3. In all of the experiments, PnetCDF has shorter mpp_open time than HDF5 due to the simpler netCDF-3  file structure. Both runtime and variability are much less in 240 PEs than in 960 PEs, indicating higher filesystem contention as the number of PEs is increased. The mpp_read metric measures the time of all PEs to read data from the input files. Its dependence on I/O layout is shown in Fig. 4. As I/O layout is only applicable to write rather than read operations, mpp_read time should be unaffected by I/O layout, as demonstrated in the figure. We also observe no consistent difference in mpp_read due to the choice of I/O library. As with mpp_open time, the mpp_read time is much higher for 960 than 240 PEs, which we again attribute to the increased file locking times and OST contentions when using more PEs.
The majority of I/O time is due to mpp_write, which depends strongly on the choice of I/O layout, as shown in Fig. 5. In the 240 PE benchmarks, the write time drops quickly as we increase the number of I/O nodes (n y ) and more gently as the number of I/O PEs per node (n x ) is increased. The 960 PE benchmarks show a similar trend to the 240 PE results. The shortest write time of the 960 PE benchmarks is less than that of 240 PE ones, which indicates that parallel write time demonstrates the same degree of scalability. All libraries present similar mpp_write trend over I/O layout, as they approach the shortest mpp_write time with moderate number of PEs per node (i.e. two or four PEs per node).
The mpp_close metric measures the time to close files, which involves synchronizations across all I/O ranks. Its dependence on I/O layout is shown in Fig. 6. We observe that there is a notable loss of performance in the HDF5 1.8.20 library, which is exacerbated as both the number of nodes and I/O PEs per node are increased. As we shall demonstrate in a later section, this can be attributed to issues related to contentions between MPI operations and the use of the MPI_File_set_size function in a Lustre filesystem. This effect is mitigated, although still present, in the HDF5 1.10.2 library. In contrast to all HDF5 libraries, PnetCDF has negligible mpp_close time as there are fewer metadata operations in netCDF-3 than netCDF-4.
The total elapsed time versus I/O layout for all libraries are plotted in Fig. 7   PE and 4 × 30 for 960 PE, to explore the optimal settings of Lustre stripe count and MPI-IO aggregator.

Stripe count and aggregators
The Lustre stripe count and the number of MPI-IO aggregators can be set as MPI-IO hints when creating or opening a file and are the two major MPI-IO parameters affecting I/O performance. The MPI-IO hint striping_factor controls the total number of stripe counts of a file; cbnode sets the total number of collective aggregators; and cb_config_list controls the distribution of aggregators over each node. In ROMIO, there are competing rules which can change the interpretation of these parameters. For example, the total number of aggregators must not exceed the stripe count; otherwise, it will always be set to the stripe count. To simplify the parameter space, we adopt the actual number of aggregators (denoted as real_aggr) and stripe counts (denoted as real_stp_cnt) as the basic parameters in tuning the I/O performance.

PEs
The variations in each time metric versus the number of aggregators and stripe counts for each library are plotted in Fig. 8 for the 240 PE experiments.
The mpp_open time does not depend strongly on the number of aggregators. PnetCDF spends less mpp_open time than all HDF5 libraries.
The mpp_read time increases as the number of aggregators, and stripe counts are increased. Runtime is independent of library, as expected for a serial I/O operation.
The optimal mpp_write time is observed when the aggregator and stripe counts are set to 60. The overall mpp_write times are quite comparable among all HDF5 libraries, and they are slightly higher than PnetCDF, as observed in the I/O layout timings.
The mpp_close times of the HDF5-based libraries are independent of the number of aggregators and increase slightly as the stripe count is increased. HDF5 v1.8.20 spends a much greater time in mpp_close than HDF5 1.10.2. The mpp_close time is negligible for PnetCDF and shows no measurable dependence on aggregator and stripe count.
The total runtime shows similar dependences on stripe count and aggregators with mpp_write. The performance trend across libraries remains consistent over I/O tuning parameters, with PnetCDF showing the best performances, followed by HDF5 1.10.2 and HDF 1.8.20. The optimal parameters for read and write operations were observed when we set the number of aggregators and stripe count to 15 or 30. This corresponds to one or two aggregators per node, with all 15 nodes contributing to I/O operations.

PEs
The variations in each time metric on the number of aggregator and stripe count in all library/format bindings are plotted in Fig. 9 for 960 PE experiments.
The metrics for the 960 PE benchmarks show a similar trend to the 240 PE benchmarks. Both mpp_open and mpp_read times increase from 240 to 960 PE, in most cases by a factor of 2, due to the higher contentions due to accessing the same files. Using the smallest number of aggregators, namely 60 aggregators, one aggregator per node, together with an equal number of stripes, gives the best performance for both mpp_open and mpp_read times. The mpp_write times are shorter than those of 240 PE. As in previous results, PnetCDF shows the best performance, while HDF5 1.10.2 outperforms HDF5 1.8.20. We observe that the best write performance occurs when the number of aggregators and stripe counts are set to 60, or one per node. Overall, the total time is reduced when using 960 PEs.
In both the 240 and 960 PE experiments, the best I/O performance occurs when the Lustre stripe count matches the number of aggregators. Using a larger stripe count may degrade the performance, since each aggregator process must communicate with many OSTs and must contend with reduced memory cache locality when the network buffer is multiplexed across many OSTs (Bartz et al., 2015;Dickens and Logan, 2008;Yu et al., 2007).

I/O implementation profiling analysis
The above benchmark results show performance variances among different libraries and formats. In order to explore the source of differences in performance, we have developed an I/O profile to capture I/O function calls at multiple layers of the parallel I/O stack, including netCDF, MPI-IO and POSIX I/O, without requiring source code modifications. It provides a passive method for tracing events through the use of dynamic library preloading. It intercepts netCDF function calls issued by the application and reroutes them to the tracer, where the timestamp, library function name, target file name and netCDF variable name along with function arguments are recorded. The original library function is then called after these details have been recorded. It is applied similarly at the MPI-IO and POSIX I/O layers. We have disabled profiling of HDF5 and PnetCDF libraries, as both are intermediate layers. Profiling overheads were measured to be negligible in comparison to the total I/O time.
We apply the I/O profiler described above to the 240 PE benchmark experiments, using the optimal I/O parameters from the previous analysis. The profiling results are plotted in   As shown in Fig. 10, nc_close is the most time consuming netCDF function in the benchmark of HDF5 1.8.20/netCDF-4. Two underlying MPI-IO functions, MPI_File_write_at and MPI_ File_set_ size, consume the majority of time within nc_close. HDF5 metadata operations are comprised of many smaller writes, and the independent write function MPI_File_  write_at from each PE may give rise to large overheads due to repeated use of system calls. It is a known issue that using MPI_File_set_size on a Lustre filesystem which uses the ftruncate system call has an unfavourable interaction with the locking for the series of metadata communications which the HDF5 library makes during a file close (Howison et al., 2010). In practice, this leads to relatively long close times and prohibits I/O scalability.
Aside from the metadata operations, reading and writing netCDF variables are conducted collectively via MPI_File_read_at_all and MPI_File_write_at_all functions, which retain good I/O performance when processing non-contiguous data blocks.
In the HDF5 1.10.x track, collective I/O was introduced to improve the performance of metadata operations. Collective metadata I/O can improve performance by allowing the library to perform optimizations when reading the metadata, by having one rank read the data and broadcasting it to all other ranks. It can improve metadata write performance through the construction of an MPI-derived data type that is then written collectively in a single call. The call path flow of tuned 240 PE benchmark with HDF5 1.10.2/netCDF-4 is shown in Fig. 11. It shows that nc_close now invokes MPI_File_ write_ at_all instead of MPI_File_write_at in HDF5 1.10.2, and HDF5 1.10.2 spends less time than HDF5 1.8.20. Furthermore, HDF5 1.10.2 has been modified to avoid MPI_File_set_size calls when possible by comparing the EOA (end of allocation) of the library with the filesystem EOF (end-of-file) and skipping the MPI_File_ set_size call if the two matches. As a result, HDF5 1.10.2 spends much less time on nc_close function than HDF5 1.8.20. Aside from the metadata operations, the general write performance of the nc_put_vara_double and nc_ put_var1_double functions show similar performance in netCDF 1.10.2 and 1.8.20 when accessing netCDF-4 formatted files.
The call path flow of the tuned 240 PE benchmark with PnetCDF is shown in Fig. 12. Due to the simpler file structure of netCDF-3, the nc_close function spends a trivial amount of time in MPI_Barrier and MPI_file_sync rather than invoking expensive MPI_File_set_size function calls, which explains the much shorter mpp_close time in the benchmark experiments. In addition, the function nc_put_vara_double also spends less time than the HDF5 libraries, which implies that the access pattern matches the contiguous data layout of netCDF-3 performs in a better way than the default block chunking layout of netCDF-4.

Load balance
Load balance is another factor which may strongly affect I/O performance. In Fig. 13 we compare the time distribution over PEs in three layers of the major write call path between HDF5 1.10.2 and PnetCDF.
In the benchmark of the HDF5 1.10.2, both nc_put_ vara_ double and MPI_File_write_at_all functions are called by eight PEs per node, as configured in the I/O layout of 8 × 15. The POSIX write function is in-voked by two PEs per node, as configured by the MPI-IO aggregator configuration, real_aggr = 30. All three functions show good load balance, as one would expect since all I/O PEs participate in the collective I/O operations. There are overheads in the nc_put_vara_double and MPI_File_write_at_all functions, but there is a larger time gap between MPI_File_ write_ at_ all and the POSIX write call, which reflects the communication overhead among aggregators and other PEs associated with collective buffering. A similar pattern also appears in the PnetCDF profile. Although HDF5 1.10.2 and PnetCDF spend a similar amount of time on POSIX write calls, the aggregation overheads are much higher for HDF5. This suggests again that the conventional contiguous storage layout in netCDF-3 matches the access pattern better than the default block chunking layout of netCDF-4.

Serial read and parallel read
As indicated in the above benchmark experiments, the write performance is optimized by choosing an appropriate number of I/O PEs, aggregators and Lustre stripe counts. In contrast to mpp_write, the mpp_read time grows from 240 to 960 PE benchmarks and can potentially become a major performance bottleneck for a large number of PEs. Since I/O layout is not employed in the parallel read process and the input files may use different formats and data layouts, there is no means to skilfully tune the parallel read performance.
As noted earlier, the serial mpp_read time is relatively small and stable in both 240 and 960 PE benchmarks. This motivates us to combine the original serial read with the parallel write in order to approach the best overall I/O performance. The 960 PE benchmarks with an I/O layout of 4×30 and using serial read (denoted here as sread) and parallel write methods are shown for the HDF5 1.10.2 and PnetCDF libraries. The performance is compared with the parallel read benchmarks (denoted as pread) in Fig. 14.  The mpp_read time is much shorter in the serial read benchmarks than the parallel reads and it remains fixed as stripe count is increased. The mpp_open times increase with stripe count but are otherwise consistent across the four benchmarks shown. The serial read is unaffected by the write performance and file closing times. As a result, the net serial read time is overall shorter than parallel read times in both HDF5 1.10.2 and PnetCDF benchmarks.

I/O performance validation of 0.1 • model
The tuning results from the 0.25 • model suggests that the best parallel I/O performance could be achieved with the following settings: a parallel write with a moderate number I/O PEs per node to access the file, as defined by I/O layout; one or two aggregators per node, as defined by MPI-IO hints; a stripe count matching the number of aggregators, as defined by MPI-IO hints.
a serial read on input files with the same stripe count as parallel write.
In this section we apply the above settings to the 0.1 • model and measure their impact on I/O performance. As shown in previous results, the HDF5 1.8.20 library is overall slower than the HDF5 1.10.2 due to its higher metadata operation overheads, so we focus on the HDF5 1.10.2 and PnetCDF libraries.
The domain layouts of the 720 and 1440 PE runs are 48 × 15 and 48 × 30, respectively. We choose I/O layouts of 3 × 15 and 3 × 30 for 720 and 1440 PEs, respectively, so there is one I/O PE per node. The number of aggregators is also configured to one per node, and the stripe count is set to the total number of aggregators, i.e. 45 and 90 for the 720 and 1440 PE runs, respectively. For all benchmark experiments, we use serial independent reads and parallel writes. The measured time metrics in 720 and 1440 PE runs for the HDF5 1.10.2 and PnetCDF libraries are shown in Table 5. The timings of the original single-threaded single file I/O (SIO) pattern in 720 and 1440 PE runs are also listed for comparison.
As shown in Table 5, the original serial I/O pattern requires a very long time (about 6 h) to create a large diagnostic file (2.7 TB) and multiple restart files (75 GB) in 720 PE runs. The serial 1440 PE runs exceeded the platform job time limit of 5 h and could not be completed, but the lack of scalability of serial I/O indicated by 0.25 • model (Table 4) suggests that the total time would be comparable to the 720 PE runs. We noticed that the PnetCDF timings are 20 % faster than the HDF5 times, as also observed in the 0.25 • model benchmarks. Both libraries have similar non-I/O times at each level of PE count, which comprise less than 5 % of total runtime, demonstrating that the benchmarks are I/O intensive and that different libraries have no impact on the computation time.
The values of mpp_write in parallel I/O are much shorter than the serial times. In the 720 PE runs, the parallel write time is about 30 to 36 times faster than the serial time in both the HDF5 and PnetCDF libraries. Such speedups are reasonable relative to the 720 PE configuration, which uses 45 I/O PEs, aggregators and stripe counts. In the 1440 PE benchmark, which also doubles our number of I/O PEs, aggregators and stripe counts to 90, the parallel mpp_write runtime was further reduced by a factor of 2. We also observe that the non-I/O compute time of MOM from 720 to 1440 PE runs was reduced by a factor of 2, complementing the enhanced I/O scalability of the parallel I/O configuration and maintaining the high overall parallel scalability of the model for I/O intensive calculations.
The PnetCDF library shows better write performance than HDF5 in both serial and parallel I/O, as well as a much shorter time in mpp_close. To investigate such performance diversity, we have conducted further tests on changing the data layout of HDF5/netCDF-4.
All HDF5 performance results used the default block chunking layout, where the chunk size is close to 4 MB with a roughly equal number of chunks along each axis. We repeated these tests by customizing the chunk layout while keeping all other I/O parameters unchanged. The chunk layout, (ckx, cky), could be defined such that the global domain grids are divided into ckx and cky segments along the x and y axes, respectively. The mpp_write times and total runtimes of the 720 PE runs for chunking layouts spanning values of ckx∈{1, 2, 3, 4} and cky∈{1, 3, 5, 15} are plotted in Fig. 15. The performance of the default chunking layout of HDF5 and PnetCDF are also shown in the figure as a reference point.
The chunk layout of (1, 1) defines the whole file as a single chunk. In this case, it occupies the same contiguous data layout with PnetCDF. Not surprisingly, the mpp_write time of chunk layout (1, 1) is nearly the same as that of PnetCDF/netCDF-3 as shown in Fig. 16. In fact, the mpp_write time changes only slightly across cky values when for ckx = 1. On the other hand, changing ckx values for a fixed cky value give rise to a steeply increasing mpp_write time. Given the conventional contiguous storage layout of a 4D variable (t, z, y, x), the time dimension varies most slowly, z and y vary faster, and x varies fastest. This is also true within a chunk and increasing ckx will produce more non-contiguous chunks than increasing cky. This means an I/O PE needs more I/O operations to write a contiguous memory data block across multiple chunks along the increasing ckx than cky, and thus write times rise accordingly as shown in Fig. 14. An exception case is ckx = 3 as it used similarly short write time with ckx = 1. This is because it matches the number of x divisions of I/O layout (3, 15), and each I/O PE needs only one operation to write a line of data with the fixed y value. Instead, for ckx = 2 or ckx = 4, each I/O PE may use two or more write operations to write a line of y as it crosses multiple chunks. This makes the write time much longer for ckx∈{2, 4} than ckx∈{1, 3}.
The mpp_close time is negligible in all tests. By reducing the total number of chunks and thus the metadata operations overheads, the mpp_close time can also be controlled with the reasonable chunk layout. The total time presents the similar trend with mpp_write along different chunk layouts as shown in Fig. 15.
Choosing a good chunk layout depends strongly on the I/O layout settings. Using a single chunk in the netCDF-4 file is unnecessary as it resembles the same data layout as the netCDF-3 format. Adopting an I/O layout as the chunk shape is sufficient for achieving optimal performance if our intention is to create netCDF-4 formatted output files and to utilize more advanced features, such as compression and filtering operations.
Although benchmark tests in this work are highly I/O intensive to explore the performance of parallel I/O, the general simulation with less I/O workloads could also benefit from parallel I/O. To demonstrate it we conducted 8 d simulations of 0.1 • model with I/O frequencies in every 1 and 4 d. The I/O time and total runtime in each simulation from 720 and 1440 PEs runs are listed in Table 6. The produced ocean diagnostic files are 73 and 19 GB for 1 and 4 d I/O frequencies, respectively.
For 720 PEs the I/O time takes 6.09 % of total runtime for 1 d I/O frequency, and it reduces to 3.87 % of total runtime for a lower I/O frequency of 4 d. These are regarded as typical I/O workloads of normal model simulations at 5 % (Koldunov et al., 2019). The parallel I/O scheme could reduce the I/O weight to be less than 1 % of total run time in both netCDF-4 and netCDF-3 formats. It is noticed that total overheads from those one-time I/O operations such as mpp_read, mpp_open and mpp_close are comparable with and in most cases larger than mpp_write time due to the very limited number of write frequency, i.e. eight time steps for 1 d and two time steps for 4 d I/O frequencies. This gives rise to a weaker scalability between serial I/O and par-

Conclusions
We have implemented parallel netCDF I/O into the FMS framework of the MOM5 ocean model and presented re-sults which demonstrate the I/O performance gains relative to single-threaded single-file I/O. We present a procedure for tuning the relevant I/O parameters, which begins with identifying the I/O parameters that are sensitive to overall performance by using a light-weight benchmark program. We then systematically measure the impact of this reduced list of I/O parameters by running the MOM5 model at a lower (0.25 • ) resolution and determining the optimal values for these parameters. This is followed by a validation of the results in the higher (0.1 • ) resolution configuration. Several rules for tuning the parameters across multiple layers of the I/O stack are established to maintain the contiguous access patterns and achieve the optimal I/O performance. At the user application layer, I/O domains were defined to retain more contiguous I/O access patterns by mapping the scattered grid data to a smaller number of I/O PEs. We achieve the best performance when there is at least one I/O PE per node, and there can be additional benefits to using multiple I/O PEs per node, although an excessive number of I/O PEs per node can impede performance. At the MPI and Lustre levels of the I/O stack, it was found that the number of aggregators used in collective MPI-I/O operations and the number of Lustre stripe counts needed to be consistently restricted to no more than two per node in order to facilitate contiguous access and reduce the number of contentions between PEs.
An I/O profiling tool has been developed to explore overall timings and load balance of individual functions across the I/O stack. It was determined that the MPI implementation of particular I/O operations in the HDF5 1.8.20 library used by netCDF-4 caused significant overhead when accessing metadata and that these issues were largely mitigated in HDF5 1.10.2. Additional profiling of the PnetCDF 1.9.0 library showed that it did not suffer from such overhead, due to the simpler structure of the netCDF-3 format.
High-resolution MOM5 benchmarks using the 0.1 • configuration were able to confirm that the parallel I/O implementations can dramatically reduce the write time of diagnostic and restart files. Using parallel I/O enables the scaling of I/O operations in pace with the compute time and improves the overall performance of MOM5, especially when running an I/O-intensive configuration resembling our benchmark. The parallel I/O implementation proposed in this paper provides an essential solution that removes any potential I/O bottlenecks in MOM5 at higher resolutions in the future.
Although this work is applied to a model with a fixed regular grid, these results could be applied to a model with an unstructured mesh. Much of the work required to populate the I/O domains and to define chunked regions is required to produce contiguous streams of data which are passed to the I/O library. If the data are already stored as contiguous 1D arrays, then the task of dividing the data across I/O servers could be trivial. If more complex data structures are used, such as linked lists, then the buffering of data into contiguous arrays could add significant overhead to parallel I/O. An investigation of data compression is not a part of this work, as traditionally it can only be used in serial I/O. We note that the more recent version of HDF5, 1.10.2, intro-duced support for parallel compression, and it is expected that the netCDF library will soon follow. As the I/O layout generally picks up one to two I/O PE per compute node, it may produce chunks which are too large (i.e. too small a number of chunks) for efficient parallel compression. In this sense, the default chunk layout of netCDF4 should also be considered as it gains acceptable write performance and has suitable chunk sizes more suitable for parallel compression. Finally, it is explored that parallel I/O could not only largely accelerate I/O intensive model simulations but also prompt the scalability of a general case with typical I/O workloads.