A fast input/output library for high-resolution climate models

. We describe the design and implementation of climate fast input/output (CFIO), a fast input/output (I/O) library for high-resolution climate models. CFIO provides a simple method for modelers to overlap the I/O phase with the computing phase automatically, so as to shorten the running time of numerical simulations. To minimize the code modiﬁcations required for porting, CFIO provides similar interfaces and features to parallel Network Common Data Form (PnetCDF), which is one of the most widely used I/O libraries in climate models. We deployed CFIO in three high-resolution climate models, including two ocean models (POP and LICOM) and one sea ice model (CICE). The experimental results show that CFIO improves the performance of climate models signiﬁcantly versus the original serial I/O approach. When running with CFIO at 0 . 1 ◦ resolution with about 1000 CPU cores, we managed to reduce the running time by factors of 7.9, 4.6 and 2.0 for POP, CICE, and LI-COM, respectively. We also compared the performance


Introduction
Scientific computing for climate modeling has undergone radical changes over the past decade.One major trend is to increase the resolution of the models, so as to provide finer simulation of physical processes of the atmosphere, ocean, land, and sea ice.This trend is motivated by the availability of supercomputers with core counts in the range of tens to hundreds of thousands.
With a higher resolution, the amount of data generated by climate models will be significantly larger than before.In order to provide scientific data for the Fifth Assessment Report of the United Nations Intergovernmental Panel on Climate Change (IPCC AR5), modelers must run coupled climate models to simulate various types of climate change scenarios.The experiments in general last for months and generate hundreds of terabytes of data.The output of such a large amount of data results in severe performance degradation for numerical simulation experiments.
Most of the above libraries attempt to improve the I/O throughput through parallelization techniques.For real applications, the overall running time mainly consists of two phases: computing time and I/O time.While the above libraries are helpful for shortening the I/O time of large-scale data, the computing phase still needs to wait for the I/O phase in iterative simulations.In a sense, the I/O phase and the computing phase are still serial with respect to each other.There is opportunity to improve I/O efficiency through overlapping the I/O phase and the computing phase.
With these issues in mind, we designed and implemented CFIO, a parallel I/O library that is specifically developed for climate models.The main idea of CFIO is to apply an I/O forwarding technique with client-sever architecture to provide automatic overlapping of I/O and computing.The strategy of overlapping I/O with computing as proposed in CFIO is complementary to existing parallel I/O libraries.Indeed, Published by Copernicus Publications on behalf of the European Geosciences Union.

X. M. Huang et al.: CFIO
CFIO calls the PnetCDF functions directly to implement the parallel write and read on the CFIO server side.To minimize the code modifications required for porting, CFIO provides similar interfaces and features to PnetCDF, which is widely used by the climate community and different climate models.
We tested CFIO on three real-climate models: Parallel Ocean Program (POP, Smith et al., 2010), Community Ice CodE (CICE, Hunke and Lipscomb, 2010) and LASG/IAP climate system ocean model (LICOM, Yu et al., 2012).When running at 0.1 • resolution with about 1000 CPU cores, we managed to decrease the running time by factors of 7.9, 4.6 and 2.0 for POP, CICE, and LICOM, respectively.We also compared the performance of CFIO against PnetCDF and PIO in different scenarios.Although CFIO has slightly lower throughput than PnetCDF and PIO for scenarios with only data output, CFIO decreases the I/O overhead compared to PnetCDF and PIO for scenarios with both data output and computations, resulting in better overall performance for real climate models.
The current release of CFIO is version 1.20.The source code and documentation for CFIO can be downloaded from the Github website (https://github.com/cfio/cfio).
The remainder of this paper is organized as follows.Section 2 discusses the motivation and the main idea of CFIO.The design and architecture of CFIO is presented in Sect. 3 in detail.Section 4 introduces the interface of CFIO and provides a simple example.Section 5 evaluates and analyses the performance of CFIO.Section 6 introduces related work.Conclusions and possible future work are discussed in Sect.7.

Motivation
In traditional climate models, the computing phase and the I/O phase run alternately.The computing phase performs simulation for certain time, and then the I/O phase outputs results following each computing phase.Namely, the computing phase and the I/O phase for traditional climate models are serial.
In fact, for most current climate models, the initial conditions or data sets for processing are loaded at the starting phase.Then the restart files, which contain all of the initial condition information that is necessary to restart from a previous simulation, will be written to the disk at a fixed frequency.Finally, the historic files, which include all of the diagnostic variables, will also be written to the disk at certain frequency.In general, there are no random seeks, no readafter-write operations, and writing operations are usually append only for all of the appropriate parts of the initial files, restart files and historic files.
Because of the append-only data accessing patterns, the computing step does not need to wait for the completion of the last I/O step.Motivated by this observation, we consider the possibility of overlapping the I/O phase with the Another advantage of overlapping the computing phase with the I/O phase is that the efficiency of computing and storage resources can be improved.For the serial method, the computing resource is idle during the I/O phase.On the other hand, the storage resource is idle during the computing phase.For the parallel method, the computing phase and the I/O phase are both pipelined.The computing and storage resources are always fully utilized.

Design of CFIO
This section describes the general design of CFIO.We first introduce the system architecture of CFIO and discuss the I/O forwarding technique, which is the main method to achieve the overlapping of the computing phase and the I/O phase.We then analyze the maximum possible speedup we can achieve by using CFIO.We also discuss the design options for synchronous and asynchronous communication methods of I/O forwarding.

System architecture of CFIO
Overlapping computation with communication and I/O is an established method for improving the performance of a parallel program.CFIO takes the advantage of the computing pattern to reduce the I/O overhead, and uses I/O forwarding to automate the overlapping of I/O and computing.When the climate model uses CFIO as its I/O method, we would have a group of computing processes and an extra group of I/O processes to form the entire MPI communicator.For example, if we execute the original parallel program with 32 processes, and we want to use 4 CFIO processes to execute I/O operations, we will submit a parallel job with 36 processes.
The I/O forwarding technique has the following advantages: first, for each computing node, forwarding I/O requests to other nodes is useful for reducing the local competition for CPU and memory resources; second, the independent I/O processes provide a large memory buffer, which makes certain optimizations possible, such as data aggregation and rearrangement.The non-continuous writing of small data blocks can be transformed into continuous writing of large data blocks, which can significantly improve the performance of the parallel file system.
The system architecture of CFIO is shown in Fig. 3.We use a server-client mechanism to deal with forwarding and handling of I/O requests.The CFIO client is co-located with the computing process and provides the climate model with a series of interfaces for accessing the model data.When an I/O request is generated in a computing process, the CFIO client will pack the request into a message and then send it to the CFIO server via MPI communication.For high-resolution climate models, we observe a significantly higher number of write operations than read operations.The initial conditions are always read only once and the results files are written many times.There is no space to overlap the read operation and computing.Thus the current CFIO v1.20 focuses on the write operations.All the parallel read operations in CFIO v1.20 will call the corresponding PnetCDF functions directly.
As shown in Fig. 4, the data is forwarded from the computing process to the I/O process.Thus, the total running time of the simulation consists of the computing time, the I/O forwarding time and the I/O time.One possible scenario that we must consider is that the I/O time is greater than the computing time, and can not be hidden by the computing phase.In this scenario, we can increase the number of I/O processes to solve the problem.More I/O processes provide a larger buffer pool, which can accommodate more data.Meanwhile, more I/O processes also lead to a faster writing speed.In this way, we can further reduce the I/O time and completely overlap the I/O phase with the computing phase.

Speedup analysis
In this section, we formulate an analytical model to estimate the maximum possible speedup of a program when switching to CFIO.We denote the running time of a model with its default I/O approach and CFIO as T origin and T cfio , respectively.As shown in Fig. 4, T cfio and T origin can be calculated as follows:

X. M. Huang et al.: CFIO
(1) where T compute and T io are the computing time and the I/O time in one simulation step, T send and T recv are the time of sending I/O requests at the client and the time of receiving I/O requests at the server.If the I/O time cannot be hidden by the computing phase, T cfio equals the value of (T recv + T io ).As mentioned above, this scenario can be avoided by increasing the number of I/O processes.So in an ideal case, T cfio = T compute + T send .The speedup S of using CFIO can be calculated as An upper bound on speedup (when neglecting T send ) can be derived as The Eq. ( 4) means that the upper bound of speedup with CFIO is determined by the proportion of the I/O time and the computing time of the original program.The greater the proportion of I/O time in the entire running time, the greater speedup CFIO can achieve.

Communication method for I/O forwarding
In original program, the total running time of the simulation consists of the computing time and the I/O time.In the improved programs with CFIO, after the data has been forwarded from client to server, the computing phase and the I/O phase can be executed in parallel.So the ideal running time only includes the computing time and the I/O forwarding time.
Comparing the above two cases, we believe that the benefit of overlapping I/O with computing comes from the fact that the I/O forwarding time with a high speed network is much less than I/O time with a parallel file system, especially for the high-resolution climate models with a large amount of output data.
There are two options when designing the communication method for I/O forwarding: synchronous and asynchronous methods.The synchronous and asynchronous approaches we discussed here for data communication are only used to shorten the I/O forwarding time.
Our initial design for I/O forwarding is using the asynchronous communication approach.In this approach, all the I/O requests are packed into a client buffer; then, forwarding is performed by a separate sending thread during the computing phases.This approach permits I/O forwarding to overlap with computing, which implies that the major overhead of calling an asynchronous CFIO function is memory copy.
However, after performing many experiments, we observed that the asynchronous communication approach leads to network resource competition between the computing phase and the I/O forwarding phase.The competition overhead is negligible when we run the climate model with a small number of cores.However, when the number of cores increases to several hundreds, the competition leads to significant increase of the computing time, which completely overrides the benefits of overlapping the I/O forwarding with computing.
In contrast, the synchronous communication approach is a better choice for larger scale computing.When using this approach, the communication needed by the computing phase will not occur during the I/O forwarding phase.Therefore, the network resource competition is avoided.The effects of asynchronous communication and synchronous communication used in CFIO are compared in Sect.5.1.
Note that synchronous communication can lead to buffer exhaustion in I/O processes because of the bursty I/O behavior in climate models.In this case, the CFIO client has to remain idle until the CFIO server finishes handling some of the buffered requests and releases sufficient buffer space.Because the I/O pattern of a climate model is known, the buffer exhaustion can be avoided by launching enough servers, the number of which is controlled by the user.How to determine the optimal number of CFIO servers for a particular program and a particular machine environment is the subject of ongoing work.

The CFIO interface
As the netCDF format is the de facto data format in the climate community, we choose to inherit the netCDF format to minimize the required efforts in terms of code updates and data post-processing when switching to CFIO.In netCDF, writing a new data set contains a sequence of operations, which creates the data set, defines the dimensions, variables, and attributes, ends define mode, writes variable data, and closes the data set file.CFIO supports all of the functions that are required to perform this series of operations.These functions can be classified into three categories: There are four additional functions that involve initialization and finalization of the library and an operation that relates to the I/O forwarding.All of the additional functions are shown in Table 1.
For the requirement of consistency across all computing processes, all CFIO functions are defined as collective I/O operations.For example, when a climate model intends to write a new data set, all of the computing processes should call the CFIO functions in the same sequence, and the same arguments should be passed into the functions.Listing 1 shows a simple example of outputting data with CFIO.This example outputs sea surface temperature (SST) data with latitude and longitude dimensions.The cfio_init function is used to describe the dimensions of the output array and the number of CFIO servers.The function takes three arguments: LAT_PROC, LON_PROC and ratio.LAT_PROC and LON_PROC are used to describe the latitude and longitude dimension decompositions of the horizontal domain among computing processes; ratio stands for the proportion of computing processes and I/O processes.If we run the example application with N processes, there will be N/ratio processes acting as I/O processes.The cfio_proc_type function is called to indicate the type of the local process.The computing processes should run the computing code and call the CFIO data access functions to output data.The I/O process will be launched automatically to run the CFIO server.The cfio_put_vara_real function is called to output the variable data and the cfio_end_io function is called to send a signal indicating that the current I/O phase is finished.This signal is used for the management of the buffer and the communication between the sender and the receiver.
This example shows that the I/O forwarding is automatically implemented and the complicated underlying mechanisms are opaque to the modelers.If the original program is already using netCDF interfaces, the users would not need to perform any extra programming rather than adding a few configuration function calls and switching "nf90" (the prefix for the standard netCDF Fortran 90 interfaces) into "cfio" when calling the I/O functions.

Experiments
We conducted our experiments on the Tansuo100 supercomputer at Tsinghua University.The supercomputer consists of 740 nodes, each of which has two 2.93 GHz Intel Xeon X5670 6-core processors and 32 gigabytes (GB) memory.The nodes are connected through an Infiniband network, which provides a maximum bandwidth of 40 Gb s −1 .The file system is Lustre, with 1 Metadata Server (MDS) and 40 Object Storage Targets (OST).The peak writing performance of this file system is 4 GB s −1 .The node operating system is RedHat Enterprise Linux 5.5 ×86_64.All of the programs in our experiments are compiled with Intel compiler v11.1, and the MPI environment is Intel MPI v4.0.2.
In the following sections, we first describe our evaluation of CFIO through three climate models, and then provide a comparison between the performance of CFIO and PnetCDF in various scenarios.For the standalone POP, CICE and LI-COM test cases, we downloaded the models from their official websites.The official standalone versions only support NetCDF.
In the following three cases, we provide the best performance of each model by turning off all I/O operations and comparing the result with CFIO.Because of the benefits brought by overlapping I/O with computing, using CFIO comes the closest to the best performance.Therefore, we believe our proposed forwarding scheme can be a useful complement to the existing parallel I/O libraries.The POP output files consist of restart files, history files and movie files.The restart file is generated for each simulated day.History and movie files are generated for each simulated hour.The variables included in the output netCDF files are two-dimensional arrays of 3600×2400, representing the spatial domain, and three-dimensional arrays of 3600 × 2400×40, in which the third dimension represents sea depth.The total size of the final output files is 315 GB.In this experiment, the POP with 0.1 • resolution ran for 440 iterations to simulate 2 days.
We recorded the overall POP running time with CFIO and compared the results with POP running with default I/O and with NO-I/O.The overall running time with NO-I/O describes the pure computation time, which can be used as the upper bound of the maximum performance that can be achieved by complete overlapping of I/O and computing.Figure 5 shows the experimental results.Our current design requires the number of computing processes to be multiples of the number of I/O processes.Therefore, the case of 64 I/O processes and 160 computing processes is not yet covered in our current experiments.
As expected, CFIO outperformed the default I/O approach in POP, with both 32 and 64 I/O processes.When running with 1280 computing processes and 32 I/O processes, the overall running time of POP decreased from 3246 s to 471 s, which means that we obtained a 6.9× speedup for POP with CFIO, which is already close to the upper bound given in the NO-I/O case.The performance of running with 64 I/O processes is better than that of running with 32 I/O processes.In the case of 1280 computing processes, we observed that the overall running time can be further reduced by 12 % (471 s down to 413 s) when switching from 32 to 64 I/O processes.This translates into an increased speedup of 7.9× when compared to the original performance of POP.
We also compared the POP running time for each of the two different communication approaches discussed in Sect.3.3.To obtain an accurate understanding of the impact from I/O forwarding, we measured the computing time and I/O time separately in POP.We also evaluated the case with NO-I/O, to measure the pure computing running time that is not affected by any I/O operations.
Figure 6 shows the performance result for different communication methods when running with 32 I/O processes.We observed that when using the synchronous communication method, the computing time is always close to the case of NO-I/O.However, when using the asynchronous communication method, the computing time became significantly larger than the NO-I/O case when POP scales to a larger number of cores.POP running with 160 computing processes was the only case where the asynchronous communication method achieved shorter running time than the synchronous method.With the number of computing processes increased to 320, the total running time with the asynchronous communication method became larger than the synchronous method, due to the communication conflicts between the I/O forwarding and the computation.When running with 1280 computing processes, the computing time with the asynchronous communication method increased to 1101 s, which is around 3 times larger than the case with NO-I/O.Based on these results, we chose to use synchronous communication as our default communication method.

CICE case study
CICE is a sea ice model which has also been developed at Los Alamos National Laboratory.It is the sea ice component model of CESM.In general, CICE uses the same horizontal grid resolution as POP.CICE partitions the data arrays equally across all of the computing processes by using the same two-dimensional data decomposition as POP.CICE uses netCDF as the output file format for history files, and binary file format for the restart files.Similar to POP, when outputting the history file in CICE, data outputs are gathered by one process and then written to disk by calling a serial netCDF interface.Because CFIO only supports netCDF format, we only used CFIO to output history files in CICE, and the output of restart files was disabled.We used the CICE version 4.1 with 0.1 • resolution in this experiment.The CICE ran for 960 iterations to simulate 40 days.History files are generated for each simulated day.The variables included in output netCDF files are 2-D arrays of size 3600 × 2400.The output files have a fixed size of 80 GB in total.
We recorded the overall CICE running time with CFIO and compared the results with CICE running with default I/O and with NO-I/O.Figure 7  The speedup in the case of CICE was slightly lower than the case of POP.The main reason is that the I/O load of CICE is not as heavy as POP.Since the running time of CICE with NO-I/O (the pure computing time) was 178 s, the proportion of the I/O time and the computing time is 4.2.Based on Eq. ( 4), we can infer that the maximum speedup from using CFIO in the CICE case is 5.2, compared with the maximum speedup of 9.8 in POP.

LICOM case study
LICOM is an ocean model developed by the State Key Laboratory of Numerical Modeling for Atmospheric Sciences and Geophysical Fluid Dynamics (LASG) of the Institute of Atmospheric Physics (IAP) in China.It is the sea component model of the LASG/IAP Earth System Model FGOALS_g2.LICOM also partitions the data arrays equally across all of the computing processes by using a two-dimensional data decomposition of the horizontal domain.LICOM uses netCDF as the output file format.The output data, including restart files and history files, are gathered by one process and then written to disk by calling a serial netCDF interface.
In this experiment, we used LICOM version 2 with 0.1 • resolution to simulate 10 days.Restart files are generated for each simulated day.Only one restart file is generated at the end of the program.Output file variables are twodimensional arrays of 3602×1683, in the spatial domain, and three-dimensional arrays of 3602 × 1683 × 55, in which the third dimension represents sea depth.The output files have a fixed size of 144 GB in total.
Figure 8 shows the test result of LICOM.In this experiment, the number of computing processes varied from 200 to 800.The scalability of LICOM is slightly poorer than POP and CICE.When scaling to 800 computing processes, the computing performance of LICOM started to degrade.Therefore, we used a maximum of 800 instead of 1280 computing processes in this experiment.LICOM running with 800 computing processes and 50 I/O processes had an running time of 4561 s.Compared to the original running time of 9101 s, LICOM obtained a 2 times speedup by using CFIO.The running time with NO-I/O (the pure computing time) was 4383 s, with the proportion of I/O time to computing time being 1.07.Based on Eq. ( 4), we can compute that the maximum speedup from using CFIO in the LICOM case is 2.07.

Comparing CFIO with PnetCDF and PIO
As is well-known, PNetCDF was developed to support parallel I/O for NetCDF.It was based on MPI-IO to take advantage of collective I/O optimizations.PIO is an applicationlevel parallel I/O library that was developed for the CESM.PIO supports several back-end I/O libraries, including MPI-IO, NetCDF, and PnetCDF.
In our experiments, we tested the newest PnetCDF v1.4.0 and PIO v1.6.0 and chose PnetCDF as the back-end method of PIO to obtain good parallel throughput.We refer interested readers to Dennis et al. (2012) for more detail about the performance of PnetCDF and PIO.The default striping count for our Lustre file system is 1, and we changed this argument to the maximum 40 to get the best write performance.
To compare the performance of CFIO, PnetCDF and PIO, we designed two MPI test programs to evaluate different scenarios.The first MPI test program outputs a 32 GB data set with 500 variables in one large netCDF file to evaluate the write bandwidth of CFIO, PnetCDF and PIO.The second MPI test program simulates the typical I/O patterns of climate models to show the advantage of I/O forwarding technology.
In the first program, every variable is a two-dimensional array with 4096×2048 double-precision floating-point numbers.Data arrays are partitioned equally across all of the computing processes, using two-dimensional data decomposition.The size of the output data per client process decreases as the number of computing processes increases.
Figure 9 shows the throughput of CFIO as a function of the number of CFIO servers.The throughput of PnetCDF and PIO are shown for comparison.The horizontal axis of PnetCDF and PIO stands for the number of clients that call PnetCDF functions.We see that the throughput of CFIO increased with the number of CFIO servers but then stopped increasing when the number of CFIO servers reached 128.This is mainly due to the limited number of storage devices in the Lustre file system.The writing throughput of CFIO reached approximately 1 GB s −1 when using 128 servers and 512 clients.The same pattern was observed for PnetCDF.PnetCDF achieved a throughput of approximately 1.24 GB s −1 when using 128 clients.The curve of PIO is very close to that of PnetCDF, peaking at 1.2 GB s −1 for 128 clients.
The results show that the throughput of CFIO is approximately 10% less than that of PnetCDF because of the overhead that is associated with I/O forwarding.Although CFIO provides slightly lower throughput than PnetCDF, we will show that the practical performance is better than PnetCDF in real scenarios emulated by the second program.
In the second program, we emulated a computing and I/O pattern that is typical in common climate models.There are a total of 40 loop iterations in the program.In each loop iteration, the program takes 7.5 s to perform floating-point computations, and will produce 3.2 GB of data.There are no intercommunication operations during the computing phases.For comparison purposes, we evaluated four different cases: CFIO, PnetCDF, PIO, and NO-I/O.NO-I/O means that all I/O operations are disabled in the second program.
Figure 10 shows the overall running time of the test program.Without any I/O operations, the total running time was 300 s.When running with 128 clients, the total running time using PnetCDF and PIO was 417 s and 431 s, respectively.PIO takes a small period of time, about 14 seconds, to initialize the I/O decomposition.The corresponding time using CFIO with 128 servers and 128 clients was 323 s.This result shows that CFIO decreases the I/O overhead compared to PnetCDF and PIO.
Figure 10 also shows that the performance of CFIO running with more servers is better.The I/O overhead with CFIO is the cost of I/O in the last loop, which cannot be overlapped, plus the cost of I/O forwarding.When the throughput of CFIO grows with the increase in the number of servers, the cost of I/O in the last loop is reduced.The cost of I/O forwarding is also naturally reduced as the number of CFIO servers increases.(2010) quantified the overhead of overlapping I/O and computation using real partitioned Computational Fluid Dynamics(CFD) solver data.They proposed that it is possible to use a small portion (3 to 6 %) of the processes in the MPI communicator as I/O processes to achieve an actual writing bandwidth of 2.3 GB s −1 and latency hiding writing bandwidth of more than 21 TB s −1 on the IBM Blue Gene/L supercomputer.
Collective buffering is an attractive and practical optimization method in MPI-IO.It also called two-phase I/O, which means breaking the I/O operation into two stages.For a collective write operation, the first stage uses the aggregators, which are a subset of MPI processes, to aggregate the data into a temporary buffer on the aggregator nodes.In the second stage, the aggregators ship the data from the aggregator nodes to the I/O servers.The advantage of two-phase I/O is that fewer nodes are necessary to communicate with the I/O servers, which reduces resource contention.ROMIO (Thakur et al., 1999), which is one of the portable MPI-IO implementations, uses two-phase optimization to improve the I/O performance.ROMIO's two-phase optimization designates some MPI ranks to be the I/O aggregators, though ROMIO's aggregators are assigned to file regions, not to clients.The data model of ROMIO is a linear stream of bytes, whereas the data model of CFIO is array-oriented.It is worth noting that CFIO implements a form of the two-phase optimization implemented in ROMIO above the Lustre file system.
I/O forwarding has also been used in Prost et al. (2001), Oldfield et al. (2006), Nisar et al. (2008), Fu et al. (2010), Docan et al. (2010) and May (2001) to reduce the I/O impact on computing.The IBM Blue Gene series of supercomputers (Yu et al., 2006) uses independent I/O nodes in their system to handle I/O requests, which are generated in computer nodes and forwarded to I/O nodes.DataStager, designed by Abbasi et al. (2009), is a data staging service that provide asynchronous data extraction for ADIOS.ADIOS provides a simple function and an external XML file to configure the data structure and I/O methods.By switching parameters in the XML file, users can choose an optimal I/O method for their application according to the runtime environment.In ADIOS, a novel BP file format is designed to decrease the overhead of maintaining metadata consistency.DataStager uses server-directed I/O to manage asynchronous communication for data transfer.The DataStager research found that the asynchronous method for data transfer can significantly impact the performance of tightly coupled parallel programs.DataStager implements two schedulers to reduce the impact.The phase-aware scheduler prevents background data transfer in the communication phase, which is predicted by DataStager or marked by the application developers.The rate limiting scheduler manages the number of concurrent requests that are made to compute nodes to control the data transfer rate.
For climate models, Dennis et al. (2012) introduced an application level parallel I/O library named PIO.It provides the flexibility to adapt to different I/O requirements for different component models of CESM.PIO utilizes an I/O forwarding in which that a portion of the compute nodes are selected to collect the output data and rearrange data in memory into a more I/O-friendly decomposition, called data rearrangement.Through data rearrangement, PIO archives better I/O throughput with less memory consumption because it leads to less function calls of back-end I/O libraries.However, PIO cannot overlap I/O with computing, that is, PIO can shorten I/O time but not hide it.Palmer et al. (2011) also proposed a specialized parallel data I/O method for the Global Cloud Resolving Model.This method can avoid the creation of very large numbers of files.The output data layout linearizes the data in a consistent way that is independent of the number of processors used to run the simulation and provides a convenient format for subsequent analysis of the data.

X. M. Huang et al.: CFIO
The design of CFIO was inspired by the technologies of overlapping I/O, two-phase I/O and I/O forwarding which have been described above.Comparatively, CFIO focuses on the requirements of high-resolution climate models and provides automatic overlapping of I/O with computing so as to shorten the entire simulation time of the climate models, not just the I/O part.CFIO uses I/O forwarding to perform overlapping I/O on remote processors so that the overhead for managing multiple threads is avoided.In addition, CFIO provides synchronous functions that perform I/O overlapping automatically.Modifications to the existing climate modeling code for asynchronous functions are not necessary.

Conclusions
In this article, we presented a parallel I/O library, CFIO, which provides automated overlapping of I/O and computing.CFIO uses similar interfaces to PnetCDF, so as to minimize the required code modification when porting.The experimental results show that CFIO outperforms PnetCDF and PIO in typical climate modeling scenarios.We also compared the performance of using different communication methods, and we found that the synchronous communication method performs better when a program is running on a larger number of cores.
For future work, we plan to conduct more experiments on different machines with different file systems.We will also adopt and test CFIO in more climate models.The MPI communication method that we used for I/O forwarding requires further optimization.The method of determining the optimal number of CFIO servers for specific climate models still needs to be study.

Fig. 2 .
Fig. 2. The schematic diagram of the I/O forwarding technique.

Fig. 5 .
Fig. 5.The overall running time of POP with different I/O approaches.

Fig. 6 .
Fig. 6.The running time for computing and I/O with different communication methods when running with 32 CFIO servers.
shows the experimental result.The number of computing processes varied from 160 to 1280.Comparing the running time of CICE with default I/O and NO-I/O, we clearly see that the I/O brings a significant overhead to both the running time and the scalability of the program.CFIO outperformed the default I/O approach in CICE, with both 32 and 64 I/O processes.When running with 1280 computing processes, the running time of CICE with 64 I/O processes was 204 s, which is less than the 226 seconds running time of CFIO with 32 CFIO servers.In terms of scalability, CICE with CFIO demonstrated a similar behavior to CICE with NO-I/O.Compared to CICE with default I/O (running time 928 s), we achieved 4.6× speedup by using 64 I/O processes.

Fig. 7 .
Fig. 7.The overall running time of CICE with different I/O approaches.

Fig. 8 .
Fig. 8.The overall running time of LICOM with different I/O approaches.

Fig. 10 .
Fig. 10.The overall running time of the MPI test application with CFIO, PnetCDF and PIO.