Data transfer means transferring data fields from a sender to a receiver. It is a fundamental and frequently used operation of a coupler. Most versions of state-of-the-art couplers currently use an implementation based on the point-to-point (P2P) communication of the message passing interface (MPI) (referred to as “P2P implementation” hereafter). In this paper, we reveal the drawbacks of the P2P implementation when the parallel decompositions of the sender and the receiver are different, including low communication bandwidth due to small message size, variable and high number of MPI messages, as well as network contention. To overcome these drawbacks, we propose a butterfly implementation for data transfer. Although the butterfly implementation outperforms the P2P implementation in many cases, it degrades the performance when the sender and the receiver have similar parallel decompositions or when the number of processes used for running models is small. To ensure data transfer with optimal performance, we design and implement an adaptive data transfer library that combines the advantages of both butterfly implementation and P2P implementation. As the adaptive data transfer library automatically uses the best implementation for data transfer, it outperforms the P2P implementation in many cases while it does not decrease the performance in any cases. Now, the adaptive data transfer library is open to the public and has been imported into the C-Coupler1 coupler for performance improvement of data transfer. We believe that other couplers can also benefit from this.

Climate system models (CSMs) and Earth system models (ESMs) are fundamental tools for simulating, predicting, and projecting climate. A CSM or an ESM generally integrates several component models, such as an atmosphere model, a land surface model, an ocean model, and a sea-ice model, into a coupled system to simulate the behaviours of the climate system, including the interactions between components of the climate system. More and more coupled models have sprung up in the world. For example, the number of coupled model configurations in the Coupled Model Intercomparison Project (CMIP) has increased from less than 30 (used for CMIP3) to more than 50 (used for CMIP5).

High-performance computing is an essential technical support for model
development, especially for higher and higher resolutions of models. Modern
high-performance computers integrate an increasing number of processor cores
for higher and higher computation performance. Therefore, efficient
parallelization, which enables a model to utilize more processor cores for
acceleration, becomes a technical focus in model development; a number of
component models with efficient parallelization have sprung up. For example,
the Community Ice CodE (CICE; Hunke and Lipscomb, 2008; Humpe et al., 2013) at 0.1

A coupler is an important component in a coupled system. It links component models together to construct a coupled model, and controls the integration of the whole coupled model (Valcke et al., 2012). A number of couplers are now available, e.g. the Model Coupling Toolkit (MCT; Jacob et al., 2005), the Ocean–Atmosphere–Sea Ice–Soil (OASIS) coupler (Redler et al., 2010; Valcke, 2013; Valcke et al., 2015), the Earth system modelling framework (ESMF; Hill et al., 2004), the CPL6 coupler (Craig et al., 2005), the CPL7 coupler (Craig et al., 2012), the flexible modelling system (FMS) coupler (Balaji et al., 2006), the bespoke framework generator (BFG; Ford et al., 2006; Armstrong et al., 2009), and the community coupler version 1 (C-Coupler1; Liu et al., 2014).

A coupler generally has much smaller overhead than the component models in current coupled systems. However, it is potentially a time-consuming component in future coupled models. This is because more and more component models (such as the land-ice model, chemistry model and biogeochemical model) will be coupled into a coupled model, and the coupling frequency between component models will be higher and higher. Data transfer is a fundamental and frequently used operation in a coupler. It is responsible for transferring data fields between the processes of two component models and for rearranging data fields among processes of the same component model for parallel data interpolation.

A coupler may become a bottleneck for efficient parallelization of future
coupled models. The most obvious reason is that the current implementation of
data transfer in a state-of-the-art coupler may be not efficient enough. For
example, due to the low efficiency of data transfer, the coupling from a
component model with a horizontal grid (576

In this study, we first propose a butterfly implementation of data transfer. Since the point-to-point (P2P) communication of the message passing interface (MPI) (referred to as “P2P implementation” hereafter) and the butterfly implementation can outperform each other in different cases (Sect. 5), we next develop an adaptive data transfer library that includes both implementations and can adaptively implement the better one for data transfer. Performance evaluation demonstrates that such a library significantly outperforms the P2P implementations in most cases and does not degrade the performance in any case. This library has been imported into the C-Coupler1 with a slight code modification. We believe that other couplers can also benefit from it.

The remainder of this paper is organized as follows. We briefly introduce the implementation of data transfer in existing couplers in Sect. 2. Details of the butterfly implementation and the adaptive data transfer library are presented in Sects. 3 and 4, respectively. The performances of data transfer implementations are evaluated in Sect. 5. Conclusions are given in Sect. 6.

Almost all state-of-the-art couplers use a similar implementation for data transfer. To achieve parallel data transfer, MCT first generates a communication router (known as the data mapping between processes) according to the parallel decompositions (the distribution of grid points among the processes) of the sender and the receiver, and then uses the P2P communication of the MPI to transfer the data. A data field will be transferred from a process of the sender to a process of the receiver, only when the two processes have common grid points, i.e. “P2P implementation” for short.

Average execution time of the P2P implementation when transferring
14 2-D fields from CLM3 to GAMIL2. In each test, the atmosphere model GAMIL2
and the land surface model CLM3 have the same number of processes; they do
not share the same computing nodes. The horizontal grid of the 14 2-D fields
contains 7680 (128

Variation of bandwidth (

Since MCT has already been imported into OASIS3–MCT, the CPL6 coupler, and the CPL7 coupler, these couplers also use the P2P implementation for data transfer. Although the other couplers, such as ESMF, OASIS4, the FMS coupler, and C-Coupler1, do not directly import MCT, they also use the P2P implementation for data transfer.

In this work, we first investigate the performance characteristics of the P2P
implementation, and therefore derive a benchmark from a real coupled model
GAMIL2 (Grid-Point Atmospheric Model of IAP LASG-Version 2)–CLM3 (Community Land Model version 3), which includes GAMIL2 (Li et al., 2013), i.e. an atmosphere
model and CLM3 (Oleson et al., 2004; Dickinson et al., 2006), i.e. a land
surface model. GAMIL2 and CLM3 share the same horizontal grid of 7680
(128

In this benchmark, there is only the data transfer with the P2P
implementation between the sender and the receiver with the same horizontal
grid as GAMIL2–CLM3. The parallel decomposition of the sender is derived from
CLM3, and the parallel decomposition of the receiver is derived from GAMIL2.
A high-performance computer called Tansuo100 at Tsinghua University, China, is
used for the performance tests. It has 700 computing nodes, each of which
contains two six-core Intel Xeon X5670 CPUs and 32 GB main memory. All
computing nodes are connected by a high-speed InfiniBand network with peak
communication bandwidth of 5 GB s

To evaluate the parallel performance of the P2P implementation, 14 2-D coupling fields are transferred between the sender and the receiver. In each test, the sender and the receiver use the same number of processes. Since there are 12 processor cores on each computing node, the number of processes is set to be an integral multiple of 12. The sender and the receiver are located on different computing nodes and the communication of the P2P implementation must go through the InfiniBand network.

Figure 1 demonstrates that the poor parallel scalability of the P2P implementation can be obtained when the parallel decompositions of the sender and receiver are different. It is well known that the communication performance heavily depends on message size. As shown in Fig. 2, the P2P communication bandwidth achieved generally increases with message size. So when the message size is small (for example, smaller than 4 KB), the communication bandwidth achieved is very low. The message size in the P2P implementation decreases when the number of model processes increases (Fig. 3), indicating that the communication bandwidth becomes lower when increasing the number of processes. The performance of data transfer also heavily depends on the number of MPI messages. As shown in Fig. 4, the variation of average number of MPI messages in the P2P implementation is consistent with the variation of the execution time in Fig. 1: both increase with the number of processes from 6 to 48, and go down with the number of processes from 96 to 192. A lower execution time of the P2P implementation will be obtained if more processes are used (the maximum number of processes in both Figs. 1 and 4 is limited to 192 because GAMIL2–CLM3 will not be further accelerated when using more processes) since the average number of MPI messages will further go down.

Variation of message size of the P2P implementation (

Variation of the number of MPI messages of one process (

To further reveal possible reasons for the poor parallel scalability, we evaluate the ideal performance and actual performance in Fig. 5. The ideal performance is much better than the actual performance, and the ratio between the ideal performance and the actual performance significantly increases when increasing the number of processes. The significant gap between the ideal performance and the actual performance is due to the network contention. For example, when multiple P2P communications share the same sender process or receiver process, they must wait in order.

The drawbacks of the P2P implementation when the sender and the receiver use different parallel decompositions can be identified as low communication bandwidth due to small message size, variable and high number of MPI messages, as well as network contention. To overcome these drawbacks, a prospective solution is to organize the transfer of data using a better algorithm, e.g. the butterfly algorithm (Fig. 6), which has already been studied in computing sciences (Chong and Brewer, 1994; Foster, 1995; Heckbert , 1995; Hemmert and Underwood, 2005; Kim et al., 2007; Jan et al., 2013; Petagon and Werapun, 2016). With respect to hardware, the traditional butterfly algorithm and its transformation have been used to design networks (Chong and Brewer, 1994; Kim et al., 2007); with respect to software, the butterfly algorithm has been used to improve the parallel algorithms with all-to-all communications (Foster, 1995), e.g. fast Fourier transform (FFT; Heckbert, 1995; Hemmert and Underwood, 2005), matrix transposition (Petagon and Werapun, 2016), and sorting (Jan et al., 2013).

Ideal and actual bandwidths of the P2P implementation (

An example of the butterfly kernel with eight processes. Each
coloured row stands for one process (

Unfortunately, the classical butterfly algorithm cannot be used as it is to improve data transfer, because it requires that one process communicates with every other process, that the communication load among processes is balanced, and that the number of processes must be a power of 2. In practice, data transfer for model coupling has different characteristics: one process needs to communicate with a part of other processes, the communication load among processes is always unbalanced, and the number of processes cannot be restricted to a power of 2. Therefore, we propose here a new implementation of data transfer involving an additional butterfly kernel to transfer data from the sender with the source parallel decomposition to the receiver with the target parallel decomposition. As the number of processes of the butterfly kernel must be a power of 2, while the number of processes of the sender or the receiver are not necessarily, the butterfly kernel has its own source and target parallel decompositions, and process mappings are required from the sender onto the butterfly kernel and from the butterfly kernel onto the receiver (see Fig. 7). Next, we present the butterfly kernel and the process mappings.

The butterfly implementation, which is composed of three parts: the butterfly kernel, process mapping from the sender to the butterfly kernel, and process mapping from the butterfly kernel to the receiver.

The first question for the butterfly kernel is how to decide its number of
processes. Any process of the sender or receiver can be used as a process for
the butterfly kernel. Given that the total number of unique processes of the
sender and receiver is

The butterfly kernel is responsible for rearranging the distribution of data
among the processes from the source parallel decomposition to the target
parallel decomposition. Given the number of processes

To reveal the advantages and disadvantages of the two implementations, we
measure the characteristics of the two implementations based on the benchmark
introduced in Sect. 2.2. The results show that the total amount of data
transferred by the butterfly implementation is larger than that transferred by the P2P
implementation (Fig. 8), which is the major disadvantage of the butterfly
implementation. Meanwhile, compared with the P2P implementation, the
butterfly implementation can have the following advantages:

bigger message size for better communication bandwidth (Fig. 9);

balanced and smaller number of MPI processes among processes (Fig. 10);

ordered communications among processes and fewer communications operated concurrently (Fig. 10), which can dramatically reduce network contention.

Total amount of data transferred by P2P implementation and butterfly
implementation (

Average message size transferred by P2P implementation and butterfly
implementation (

In this subsection, we will introduce the process mappings from the sender to the butterfly kernel and from the butterfly kernel to the receiver. To minimize the overhead of process mapping from the butterfly kernel to the receiver, we map one or multiple processes of the butterfly kernel onto a process of the receiver if the butterfly kernel has more processes than the receiver; otherwise, we map a process of the butterfly kernel onto one or multiple processes of the receiver. In other words, there is no multiple-to-multiple process mapping between the butterfly kernel and the receiver. Similarly, there is no multiple-to-multiple process mapping between the sender and the butterfly kernel.

Processes of the sender or the receiver may be unbalanced in terms of the data size transferred, which may result in unbalanced communications among processes of the butterfly kernel. As mentioned in Sect. 3.1, at each stage of the butterfly kernel, all processes are divided into a number of pairs, each of which is involved in P2P communications. To improve the balance of communications among the processes in the butterfly kernel, one solution is to try to make the process pairs at each stage more balanced in terms of the data size of P2P communications, so we propose to reorder the processes of the sender or the receiver according to data size. At the first stage, we pick out the process with the largest data size and the process with the smallest data size from the remaining processes that have not been paired, to generate a process group. For the next stage, the outputs of two process groups from the previous stage are paired into bigger process groups in a similar way. After finishing the iterative pairing throughout all stages, all processes of the sender or the receiver are reordered.

The iterative pairing also requires the number of processes to be a power of
2. Given that the number of processes of the sender (or receiver) is

Maximum number of MPI messages, average number of MPI messages and
minimum MPI messages in P2P implementation and butterfly implementation
(

An example of process mappings, given that the sender has 5
processes (

Figure 11 shows an example of the process mapping, where the sender has 5
processes (

Now, we have two kinds of implementations (the P2P implementation and the
butterfly implementation) for data transfer. Although the butterfly
implementation can effectively improve the performance of data transfer in
many cases (examples are given in Sect. 5), it has some drawbacks: (1) it
generally has a larger total amount of data transferred than the P2P
implementation; (2) its number of stages is

An example of the adaptive data transfer library with eight processes, where stage 2 of the butterfly implementation is skipped and replaced by P2P communication of three MPI messages per process.

As introduced in Sect. 3.1, the butterfly implementation is divided into multiple stages. Actually, the data transfer in one stage can be viewed as a P2P implementation with only one MPI message per process. Inspired by this fact, we try to design an adaptive approach that can combine the butterfly and P2P implementations, where some stages in the butterfly implementation are skipped and replaced by P2P communication of more MPI messages per process. When all stages of the butterfly implementation are skipped, the adaptive data transfer library completely switches to the original P2P implementation. That is to say, the adaptive data transfer can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation. Figure 12 shows an example of the adaptive data transfer library with eight processes, where stage 2 of the butterfly implementation is skipped and replaced by P2P communication of three MPI messages per process.

The most significant challenge of such an adaptive approach is to determine which stage(s) of the butterfly implementation should be skipped. The first attempt was to design a cost model that can accurately predict the performance of data transfer in various implementations. We eventually gave up this approach as it was almost impossible to accurately predict the performance of the communications on a high-performance computer, especially when a lot of users share the computer to run various applications. Performance profiling, which means directly measuring the performance of data transfer, is more practical to determine an appropriate implementation, because the simulation of Earth system modelling always takes a long time to run. Figure 13 shows our flow chart of how the adaptive data transfer library determines an appropriate implementation. It consists of an initialization segment and a profiling segment. The initialization segment generates the process mappings and a candidate implementation that is a butterfly implementation with no skipped stages. The profiling segment iterates through each stage of the butterfly implementation to determine whether the current stage should be skipped or kept. In an iteration, the profiling segment first generates a temporary implementation based on the candidate implementation where the current stage is skipped, and then runs the temporary implementation to get the time the data transfer takes. When the temporary implementation is more efficient than the candidate implementation, the current stage is skipped and the temporary implementation replaces the candidate implementation. When the profiling segment finishes, the appropriate implementation is set to be the candidate implementation. To reduce the overhead introduced by the adaptive data transfer library, the profiling segment truly transfers the data for model coupling. In other words, before obtaining an optimal implementation, the data is transferred by the profiling segment.

A flow chart for determining an appropriate implementation of the adaptive data transfer library.

In this section, we empirically evaluate the adaptive data transfer library, through comparing it to the P2P implementation and the butterfly implementation. Both toy models and realistic models (GAMIL2–CLM3 and CESM – Community Earth System Model) are used for the performance evaluation. GAMIL2–CLM3 has been introduced in Sect. 2.2. CESM (Hurrell et al., 2013) is a state-of-the-art ESM developed by the National Center for Atmospheric Research (NCAR). All the experiments are run on the high-performance computer Tansuo100.

Next, we will evaluate the overhead of initialization, the performance of transferring data fields between two toy models and between different realistic component models, and the performance of rearranging data fields within a component model for parallel interpolation.

We first evaluate the initialization overhead of data transfer implementations. As shown in Fig. 14, the initialization overhead of each implementation increases when increasing the number of processes. The initialization overhead of the butterfly implementation is a little higher than that of the P2P implementation, while the initialization overhead of the adaptive data transfer library is 2–3-fold higher than that of the P2P implementation, because the adaptive data transfer library uses extra time on the performance profiling (see Sect. 4). Considering that one data transfer instance should only be initialized at the beginning and executed many times in a coupled model, we can conclude that the initialization overhead of the adaptive data transfer library is reasonable, especially when the simulation is executed for a very long time.

Initialization time (

Average execution time (

The factors that can impact the performance of a data transfer implementation
generally include the number of MPI messages, the size of the data to be
transferred (also referred to as the number of fields in this evaluation) and
the number of processes used. In this subsection, we evaluate the impact of
each factor on the performance of data transfer for different
implementations. We first build two toy models that both use the same
logically rectangular grid of 192

In the first experiment, we fix the number of processes to be 1024 and the number of coupling fields to be 10, while varying the number of MPI messages in the P2P implementation. In each test, all processes of the sender have the same number of MPI messages. As the number of MPI messages is determined by the parallel decompositions of the sender and the receiver, we design an algorithm (Algorithm 1) that can generate the parallel decompositions of the two toy models according to the average number of MPI messages of the sender in the P2P implementation. Figure 15 shows the execution time of one data transfer with different implementations when increasing the number of MPI messages per sender process in the P2P implementation from 1 to 90. The P2P implementation can outperform the butterfly implementation when the number of MPI messages is small (e.g. smaller than 12 in Fig. 15), while the butterfly implementation can outperform the P2P implementation when the number of MPI messages is big (e.g. bigger than 12 in Fig. 15). The adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation and, moreover, it improves the performance based on the butterfly implementation when the number of MPI messages is big, since some butterfly stages of the butterfly implementation are skipped. When the number of MPI messages is 90, the adaptive data transfer library can achieve a 19.2-fold performance speed-up compared to the P2P implementation.

Average execution time (

In the second experiment, we fix the number of processes and the number of MPI processes per sender process in the P2P implementation, and vary the number of coupling fields transferred. Figure 16 shows the execution time of one data transfer with different implementations in this experiment. The results show that the execution time of each implementation increases with the increment of data size. When the number of MPI processes per sender process in the P2P implementation is small (Fig. 16a, b), the performance of the butterfly implementation is poorer than that of the P2P implementation, especially when the number of 2-D coupling fields gets bigger. When the number of MPI messages per sender process in the P2P implementation is big (Fig. 16c, d), the butterfly implementation significantly outperforms the P2P implementation; however, the advantage of the butterfly implementation decreases when increasing the number of coupling fields. The results also demonstrate that the adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation, and can further improve the performance based on the butterfly implementation.

In the third experiment, we fix the number of MPI messages per sender process in the P2P implementation to be 24 and the number of coupling fields transferred to be 10, and vary the number of processes used. Figure 17 shows the execution time of one data transfer with different implementations when varying the number of processes. The P2P implementation outperforms the butterfly implementation when a small number of processes are used (e.g. smaller than 256 in Fig. 17), while the butterfly implementation outperforms the P2P implementation when a large number of processes are used (e.g. larger than 256 in Fig. 17). Similar to the above two experiments, the adaptive data transfer library can adaptively choose the optimal implementation from the P2P implementation and the butterfly implementation.

Average execution time (

The resolution of models becomes higher and higher these days. How about the performance of the data transfer implementations when model resolution becomes higher? Higher model resolution means that a model will use more processes for accelerating a simulation, while the average number of grid points per process can remain constant. Considering that the numbers of grid points are always balanced among the processes of a model, we make each process (which runs on a unique processor core) of the toy models evenly have around 96 grid points in this evaluation, while enabling processes to have different number of MPI messages and different message sizes (the average number of MPI messages of the sender in P2P implementation is 34). As shown in Fig. 18, although the execution times of all data transfer implementations increase when increasing the number of processes (from 64 to 1024), the butterfly implementation significantly outperforms the P2P implementation. So the adaptive data transfer library adaptively chooses the butterfly implementation, and further slightly outperforms the butterfly implementation when each model uses more than 512 processes because some butterfly stages are skipped.

Average execution time (

In this subsection, we evaluate the performance using two realistic models:
GAMIL2–CLM3 (horizontal resolution of 2.8

For CESM, we use the data transfer between the coupler CPL7 (Craig et al.,
2012) and the land surface model CLM4 (Oleson et al., 2004), where 32 2-D
coupling fields on the CLM4 horizontal grid (the grid size is
144

Average execution time (

For GAMIL2–CLM3, we use the data transfer from CLM3 to GAMIL2 where 14 2-D
coupling fields on the GAMIL2 horizontal grid (whose grid size is
128

Average execution time (

Besides data transfer between different component models, there is another
kind of data transfer in model coupling that rearranges data inside a model
for parallel interpolation of fields between different grids. Here, we use
the data rearrangement for the parallel interpolation from the atmosphere
grid (whose grid size is 144

Average execution time (

With the performance improvement of data transfer, we expect that the
adaptive data transfer library will improve the performance of coupled
models. For this evaluation, we first imported the adaptive data transfer
library into C-Coupler1, used it in the coupled model GAMIL2–CLM3, and
measured performance results. As shown in Fig. 22, the adaptive data transfer
library achieves higher speed-up with respect to the whole model time (when
the P2P implementation is used as the baseline) for GAMIL2–CLM3 when using
more than 16 processes. When each component model uses 128 processes, the
butterfly implementation achieves

Performance improvement with respect to the whole model time for the coupled model GAMIL2–CLM3 achieved by the butterfly implementation and the adaptive data transfer library, using the P2P implementation as the baseline.

Data transfer is a fundamental and frequently used operation in a coupler. This paper showed that the P2P implementation currently used in most state-of-the-art couplers for data transfer is inefficient when the parallel decompositions of the sender and the receiver are different, and further revealed the corresponding performance bottlenecks. We showed that the butterfly implementation can outperform the P2P implementation in many cases but degrades the performance in some cases, for example when a small number of processes are used to run models or when the parallel decompositions of the sender and receiver are similar. We therefore designed and implemented an adaptive data transfer library that automatically chooses an optimal implementation between the P2P implementation and the butterfly implementation and also further improves the performance based on the butterfly implementation through skipping some butterfly stages. Compared to the P2P implementation, the adaptive data transfer library can improve the performance of data transfer when the parallel decompositions of the sender and the receiver are different.

The initialization overhead for the adaptive data transfer library could become expensive when using a large number of processes. In the future version, the adaptive data transfer will allow users to record the results of performance profiling offline to save the time used for performance profiling in the next run of the same coupled model.

The source code of the adaptive data transfer library version 1.0 is
available at

This work is supported in part by the Natural Science Foundation of China (no. 41275098), the National Grand Fundamental Research 973 Program of China (no. 2013CB956603), and the Tsinghua University Initiative Scientific Research Program (no. 20131089356). Edited by: S. Valcke