Parallelizing a serial code: open–source module, EZ Parallel 1.0, and geophysics examples

Parallel computing can offer substantial speedup of numerical simulations in comparison to serial computing, as parallel computing uses many processors simultaneously rather than a single processor. However, it typically also requires substantial time and effort to convert a serial code into a parallel code. Here, a new module is developed to reduce the time and effort required to parallelize a serial code. The tested version of the module is written in the Fortran programming language, while the framework could also be extended to other languages (C++, Python, Julia, etc.). The Message Passing Interface is 5 used to allow for either shared-memory or distributed-memory computer architectures. The software is designed for solving partial differential equations on a rectangular two-dimensional or three-dimensional domain, using finite difference, finite volume, pseudo-spectral, or other similar numerical methods. Examples are provided for two idealized models of atmospheric and oceanic fluid dynamics: the two-level quasi-geostrophic equations, and the stochastic heat equation as a model for turbulent advection–diffusion of either water vapor and clouds or sea surface height variability. In tests of the parallelized code, the strong 10 scaling efficiency for the finite difference code is seen to be roughly 80% to 90%, which is achieved by adding roughly only 10 new lines to the serial code. Therefore, EZ Parallel provides great benefits with minimal additional effort.


Introduction
Numerical simulations of climate and other geophysical systems can be very computationally expensive in both memory 15 and time, especially when the domain consists of a large physical region. For many years, a solution to these problems has been parallel computing (e.g., Drake and Foster, 1995), in which several processors are utilized simultaneously to divide the computational load.
However, the process of parallelizing a serial code may require a significant amount of time and effort, especially if one is not already familiar with parallel computing algorithms and platforms, such as CUDA, OpenMP, and the Message Passing Interface 20 (MPI). Furthermore, in some cases it is necessary to make a complete overhaul of the original software to take advantage of multi-processor systems if, for instance, one desires a parallel code that is optimized to the greatest extent possible; even though this is not always the case, it leads to the general impression that parallel codes are complicated. As a result, the process of parallelizing a serial code is typically viewed as a major undertaking.
The goal of the module is to allow the user to make relatively small changes to their original codes to create a new code that can leverage the benefits of parallelism.
The EZP software should be useful for several different scenarios. For example, simulations of idealized models are commonly used, such as quasi-geostrophic (QG) models for the atmosphere or ocean (e.g., Thompson and Young, 2007;Edwards et al., 2020a, b), stochastic models of cloud patterns (e.g., Hottovy and Stechmann, 2015;Ahmed and Neelin, 2019;Khouider 30 and Bihlo, 2019), tropical circulation models (e.g., Neelin and Zeng, 2000;Lin and Neelin, 2002;Khouider and Majda, 2005;Lin, 2009), and coupled ocean-atmosphere models for El Niño-Southern Oscillation (ENSO) (e.g., Zebiak and Cane, 1987;Xie and Jin, 2018;Thual et al., 2016Thual et al., , 2019, to name only a few examples. Such idealized models have typically been simulated using serial codes, in large part due to the large time and effort required to parallelize a serial code. The EZP software can help in these scenarios to bring the benefits of parallelism with a minimal amount of additional effort. These scenarios all 35 share a common setup of a set of partial differential equations (PDEs) on a two-dimensional (2D) or three-dimensional (3D) domain, and the EZP module is designed for such a setup. An example of a setup that would not be immediately suitable for EZP is a discrete element model (DEM) for sea ice floes (e.g., Herman, 2016), since it is a model for individual particles (ice floes), rather than a PDE model; nevertheless, this scenario could possibly also be parallelized with some further modifications to the EZP module. Also, for some large modeling projects, a more detailed rewriting of a code may be needed beyond the 40 minimalist approach described in the present paper, if fully optimized code is desired; nevertheless, even in this scenario, EZP could provide a starting point for parallelism which could be further refined and optimized. Along these lines, EZP is written using MPI, so it can be used for either shared-memory or distributed memory computer architectures.
A related type of tool is automatic parallelization, which is intended to take a serial code and, without needing the user to change the code at all, automatically implement parallelization in its execution (Griebl, 2004;Bondhugula et al., 2008;45 Benabderrahmane et al., 2010;Kraft et al., 2018). Such tools are typically designed to treat a variety of different codes with great generality. In their simplest forms, they detect "for loops" in the code and, if possible, they break up the tasks using multithreading or parallelization. Some commercial tools that use these and other strategies include the Matlab Parallel Computing Toolbox™ and Matlab Parallel Server™ or Matlab Distributed Computing Server™. A disadvantage of Matlab is that it is often slower than other programming languages (Fortran, C++, etc.) for PDEs with time-stepping. For those other languages, 50 the Intel® Fortran and C++ Compilers also have automatic parallelization capabilities, and so does the freely available GNU Compiler Collection (GCC). The commercial tools are not freely available and therefore may not be an option for many users.
Auto parallelization is typically limited to shared-memory computer architectures, which limits the amount of parallelization that can be achieved. Also, for some codes, auto parallelization may not achieve parallelism or may not provide a great amount of speedup. Despite these limitations, it may be a good choice for some users. However, some users may also have concerns 55 about using a "black box" to automatically parallelize a code. Some users prefer to have some control over and some knowledge of the parallelization process. For such users, EZP can help with parallelization with a minimal amount of user effort, while also engaging the user in the parallelization process, and allowing the user to further optimize the parallelization if desired. and present the performance improvements obtained through parallelization with EZP in Section 3. The examples demonstrate the parallelization of two serial Fortran codes, including one finite difference code and one pseudospectral code, and both versions are available at the GitHub repository https://github.com/jasonlturner/EZ_PARALLEL_project We hope that the comments in the programs, the documentation included in the repository, and our discussion here makes the 70 module accessible and easy to modify for more personalized use.

Parallelizing a serial code
In this section, we describe how to parallelize a serial code using EZP. As a conceptual framework for doing so, we assume that the code, whether serial or parallel, consists of two main stages: 1. Initialization (including reading input files and creating the domain grid), and 75 2. Time-stepping (including stepping the domain grid forward in time and writing output files).
Algorithm 1 is pseudocode of a serial simulation in this format.  In creating the parallel code in Alg. 2, the first new addition is a new line of code to call the CREATE_PARALLEL_SCHEME subroutine of the EZP module. This subroutine will decompose the domain grid, and while this is done automatically for the user, we describe some details of the domain decomposition in what follows, as background information.
In a parallel simulation, the domain grid is divided into several sub-grids, as illustrated in Fig. 1. Each sub-grid is handled by 85 a single processor. A vertical slab decomposition is utilized by EZP. To elaborate on the definition of this grid decomposition, suppose the two-dimensional simulation domain is discretized by a grid with N x points in the horizontal, N y points in the vertical, and n be the number of processes used to execute the simulation. We denote variables local to a process with a superscript (k) with k = 0, . . . , n − 1.
The domain decomposition in EZP is defined by the creation of what we call a parallel scheme, executed by the CALL 90 CREATE_PARALLEL_SCHEME command in the code. The parallel scheme contains all information needed for the grid decomposition, and various parallel computing functions that we will discuss in the following subsections.
One task of the parallel scheme is to divide the N x columns of the grid as evenly as possible among the n processors. For example, consider a grid with N x = 32 columns to be divided among n = 3 processors. The value of N x would be adjusted by the CREATE_PARALLEL_SCHEME subroutine to the total number of columns of the local sub-grid. In other words, N x would 95 be automatically changed to N (k) x , the sum of the number of columns in the sub-grid interior and in the sub-grid boundary. In this example, sub-grids 0 and 1 would have 11 columns in their interior while processor 2 would have 10 columns in its interior (see Fig. 1).
Each sub-grid also has sub-grid boundary points that are also included in each value of N (k) x . The sub-grid boundary points represent information that may be needed from neighboring sub-grids. For example, consider the following discretization of Processor 0 Processor 1 Processor 2 Figure 1. The decomposition of a grid (top) into vertical slabs (bottom). Note that we cannot divide the 32 columns of the grid evenly among three processors, so the grid is decomposed as nearly evenly as possible, with the two sub-grids with 11 columns, and one sub-grid with 10 columns. the 2D Laplacian operator As illustrated in Fig. 2, points on the right-most column of a sub-grid (aside from the right-most sub-grid) will require information from the neighboring sub-grid, and likewise for points on the left-most column of a sub-grid. We will elaborate on sub-grid boundary communication in Subsection 2.3; for now we discuss only the effect of sub-grid boundary points on the 105 value of N (k) x . As each sub-grid requires one column from each of its neighbors to execute the numerical scheme, sub-grids 0 and 2 would have one column in their boundaries while sub-grid 1 would have two columns in its boundaries. Therefore, the sub-grids would have N (0) x = 12, and N (2) x = 11 columns, respectively, and the domain decomposition has thus been determined.

110
In this subsection, we explain that a user needs to make no changes to either the INITIAL_GRID or TIMESTEP subroutines to turn the serial Alg. 1 into the parallelized Alg. 2.
After creating the parallel scheme, the next step in the code is to define the initial conditions. The INITIAL_CONDITION subroutine in the original serial code must be of the form described in Alg. 3 in order to be parallelized with EZP. Specifically, Processor 0 Processor 1 Figure 2. The template for the discretization of the Laplacian from Eq. (1) at the boundary of two sub-grids. Due to this discretization, the sub-grid boundary consists of one column from each of its neighbors. Continuing the example from Fig. 1, the sub-grids will then have, including boundary columns, a total of 12, 13, and 11 columns, respectively. EZP relies on the use of an initial condition function f (x, y) with a reference point (x ref , y ref ) located in the top-left of the 115 grid. In this form, no changes are required to parallelize the serial code with EZP.
We will now explain how EZP parallelizes the INITIALIZE_GRID and TIMESTEP subroutines. The it is called, and the TIMESTEP subroutine works on the local level as well.  In the original serial code (Alg. 1), the TIMESTEP subroutine updates the interior of the grid using values from its boundary during each time-step. The TIMESTEP subroutine is automatically changed to update the interior of a sub-grid using values from the sub-grid boundary through the creation of a parallel scheme (see Subsection 2.2). Since the boundary of a subgrid in the parallel code corresponds to the interior of the grid, it is necessary to update the boundary of the sub-grids at 130 each time-step as well. Since the boundary of one sub-grid corresponds to the interior of a neighboring sub-grid, we may update the sub-grid boundary through inter-process communication. This inter-processor communication is handled by the SHARE_SUBGRID_BDRY subroutine, within the EZP module, and it is handled automatically for the user after adding in the one new line of code to call the subroutine.

135
The final stage of the code is the WRITE_OUTPUT subroutine of Alg. 1. The user will need to make some minor modifications to their own WRITE_OUTPUT subroutine in order to create the parallel version in Alg. 2.
To see the changes that need to be made to the WRITE_OUTPUT subroutine, consider a serial code that outputs a single file named 'output.out'. Without changes, each processor in the parallelized code would attempt to write its individual output to 'output.out' simultaneously, leading to a variety of issues (e.g., they all have the same file name, so one processor may 140 write an 'output.out' file that overwrites the 'output.out' file created by another processor). To amend this issue, the parallel scheme contains an identification number that is unique to each processor. This identification number, also called a processor rank (Message Passing Interface Forum, 2015), is stored in the procID member of the parallel scheme. The user will need to change the name of the output file to include the processor rank, and then each processor will write its output to a unique file.

Additional changes for psuedo-spectral codes 145
If pseudo-spectral numerical methods are used, then some additional changes must be made beyond what was described above for finite difference methods, finite volume methods, etc. In particular, we discuss the three tasks of parallel fast-Fourier transforms (FFTs), spectral derivatives, and zero-padding. The EZP module provides subroutines for accomplishing these three tasks in parallel. To parallelize, a user needs to replace the calls to these three types of subroutines (FFT subroutines, spectral derivative subroutines, and zero-padding subroutines), now using the parallel subroutines provided by the EZP module. EZP 150 will then automatically handle the parallelization. Some additional details are as follows.
FFTs in many serial codes are handled by a single call to a subroutine which performs the intermediate steps. For example, FFTW has an FFTW_CREATE_PLAN subroutine to initialize FFTs and a FFTW_ONE subroutine to execute FFTs (Frigo and Johnnson, 2018). To parallelize these, EZP includes CREATE_SCHEME_FFT for initializing FFTs, and EXECUTE_SCHEME_FFT and EXECUTE_SCHEME_IFFT For executing forward and inverse FFTs. A user may parallelize their FFTs by simply replac-155 ing their original FFT-related subroutine calls with these. The parallel FFTs of EZP utilize the one-dimensional FFTs of DFFTPACK (Swarztrauber, 1985;Pumphrey, 1985) as well as the global redistribution algorithm described in Dalcin et al. (2019).
Since the entire grid is contained in a single processor in a serial code, it is straightforward to take a spectral derivative. All spectral derivatives must be changed to parallelize a serial code, as each sub-grid corresponds to a different range 160 of wave numbers. There are two ways to execute a spectral derivative in EZP: (i) EXECUTE_SCHEME_SPEC_DRV and EXECUTE_SCHEME_ISPEC_DRV, and (ii) CREATE_SPEC_DRV. The former must be initialized with the CREATE_SCHEME_SPEC_DRV subroutine, and take the desired forward or inverse spectral derivative of the local sub-grid.
The latter returns an array containing the coefficients of the desired spectral derivative of the local sub-grid, providing greater flexibility.

165
In pseudospectral codes for simulating turbulence, zero-padding is often used to eliminate the aliasing error which occurs in non-linear functions (Orszag, 1971;Donnelly and Rust, 2005). For the same reason that all spectral derivative must be changed to parallelize a serial code, zero-padding must also be handled differently. To accomplish this, EZP provides the CREATE_SCHEME_ZERO_PAD subroutine to create a parallel scheme for a zero-padded grid using a parallel scheme for the original grid, as well as the EXECUTE_SCHEME_ZERO_PAD and EXECUTE_SCHEME_IZERO_PAD subroutines to zero-pad and inverse zero-pad the grid.
We have utilized EZP to parallelize two example codes to demonstrate its abilities and efficacy: a two-level quasi-geostrophic (QG) equations solver (Subsection 3.1) and a stochastic heat equation solver (Subsection 3.2). The former utilizes a psuedospectral method for solving spatial derivatives and a six-stage adaptive additive Runge-Kutta method for time-stepping, which 175 demonstrates the performance gains of using parallel two-dimensional FFTs over serial ones. The latter utilizes a finite difference method for spatial derivatives and a forward-Euler time-stepping scheme, which demonstrates the performance gains of decomposing the grid into sub-grids which requires sub-grid boundary communication.

Two-level quasi-geostrophic equations
The two-level QG equations serve as a simple but fully non-linear fluid model featuring baroclinic instability in a 2D periodic 180 domain: These equations describe the evolution of the barotropic q ψ and baroclinic q τ modes of the potential vorticity (e.g., Qi and Majda, 2016). These quantities are also determined by the the corresponding barotropic and baroclinic streamfunctions ψ, τ , via q ψ = ∇ 2 ψ and q τ = ∇ 2 τ −k 2 d τ . Further, k d is the baroclinic deformation wavenumber, β is the Rossby parameter, J (A, B) = ∂A ∂x ∂B ∂y − ∂A ∂y ∂B ∂x is the Jacobian operator, U represents a large-scale vertical shear (equal in strength with opposite directions for each mode), ν ∇ 2 s q i represents the hyperviscosity, and κ ∇ 2 (ψ − τ ) represents Ekman friction on the lower level of the flow. For more details on the two-level QG equations, see Vallis (2006); Qi and Majda (2016).
The implementation of the numerical method is as follows. The pseudo-spectral method uses Eqs. 2, 3 transformed into spec-190 tral space, which changes all spatial derivatives into scalar multiplications. The hyperviscocity, Ekman friction, and mean shear terms are handled in spectral space, and then the non-linear multiplications in the Jacobian terms are handled in physical space (utilizing zero-padding for de-aliasing). We utilize these solutions in each stage of an adaptive additive Runge-Kutta method (AARK) to progress forward in time. The AARK method is made up of ARK4 (3) (Kennedy and Carpenter, 2003). Excluding USE statements and MPI initialization and finalization, approximately 85 lines of code were changed between the serial solver and the parallel solver. Most of these changes involve simple replacements, as described in subsection 2.5; for instance, every call to a serial FFT subroutine is replaced by a call to the parallel FFT subroutine provided in the EZP module.
For our own tests, we utilized the high performance computing (HPC) cluster that is serviced by the University of Wisconsin-

200
Madison's Center for High Throughput Computing to investigate the scaling performance of the implementation of the EZP library to parallelize a serial two-level QG equation code. A dedicated partition was used, whose 35 allocatable nodes have a single 10 core (20 thread) Intel Xeon E5-2670 v2 CPU at 2.50 GHz, allowing for a total of 700 processes to be run simultaneously.
Simulations were run with a random initial condition on a grid with dimensions N x = N y = 2048 and 10 3 time-steps, using 205 a range of two to 128 processors, utilizing eight processors per node for runs of eight or more processors. We chose not to utilize all 20 available processors per node due to the known lack of optimization of MPI_ALLTOALL on shared-memory systems (Kumar et al., 2008). Due to the memory-intensive nature of the six-stage AARK method used, there was not enough memory available for a single-processor run with N x = N y = 2048. The physical parameters in the simulation correspond to the atmosphere in mid-latitudes, specifically: β = 2.5, k d = 4, U = 0.2, κ = 0.05, and ν = 2.98023 × 10 −22 (Qi and Majda,210 2016). File output was disabled for these runs, and we analyzed the average execution time per time-step for consistency between runs. The results of these runs are included in Fig. 4.
The observed strong scaling efficiency is 73.0% ± 7.8%, which implies that doubling the number of processors leads to an approximate 40% reduction in execution time. The difference between the observed scaling and the "ideal" scaling may be attributed to the non-parallelizable portions of the code (Amdahl, 1967), e.g., inter-processor communication and serial one-215 dimensional FFTs. Since "ideal" scaling would provide a 50% reduction in execution time (per each doubling of the number of processors), the EZP performance of 40% reduction is somewhat close to ideal, especially in considering that it is achieved without a major overhaul of the serial code structure.

Stochastic heat equation
As an example of a different type of model, a tropical precipitation and water vapor model (Hottovy and Stechmann, 2015) is 220 used, and it takes the form of a modified stochastic heat equation where q (t, x, y) is the integrated column water vapor at horizontal location (x, y), b 0 is a spatial interaction constant h −1 , τ is the relaxation time (h), q * = 65 mm is the relaxation target, F is external forcing mm h −1 , D 2 * is the stochastic forcing variance mm 2 h −1 , andẆ (t, x, y) is spatiotemporal white noise. This linear equation is designed to capture idealizations 225 of precipitation, evaporation, and turbulent advection-diffusion of water vapor. Although it is simplistic, the model displays a variety of characteristics that conform to a statistical physics perspective of tropical convection, such as a fractal, scale-free distribution of cloud cluster sizes (Hottovy and Stechmann, 2015;Stechmann and Hottovy, 2016). It may be shown that this equation exhibits infinite variance at the stationary state ifẆ (t, x, y) is not regularized. However, the discretization of the domain accomplishes this. Such types of models are also used in oceanography for modeling sea surface height variability 230 (e.g., Samelson et al., 2016).
To solve this model numerically, we utilized a centered-difference approximation of spatial derivatives with a forward-Euler discrete temporal derivative to reformulate Eq. (4) as for spatial grid points i, j = 1, . . . , N and time-step k. The quantities with subscripts i, j and superscript k are the discrete 235 version of their continuous counterparts in Eq. (4) evaluated at the (i, j) th column of the atmosphere and time-step k, while ∆x, ∆y, and ∆t are the horizontal, vertical, and temporal grid spacing. Unlike Hottovy and Stechmann (2015), we utilize Dirichlet boundary conditions for simplicity. Excerpts of a generalized version of this code are included in Appendix A.
A dedicated partition of the HPC cluster at the University of Wisconsin-Madison's Center for High Throughput Computing was also used to investigate the scaling performance of the implementation of the EZP library to parallelize a serial stochastic 240 heat equation code. Simulations were run with a grid size of N x = N y = 8192 and 10 3 time-steps, using a range of one to 128 processors, utilizing eight processors per node for runs of eight or more processors. File output was disabled for these runs.
The results of these runs are included in Figure 5.
The observed strong scaling efficiency was 86.9% ± 1.2%, which implies that doubling the number of processors leads to an approximate 45% reduction in execution time. The difference between the observed scaling and the "ideal" scaling may be 245 attributed to the non-parallelizable portions of the code. With "ideal" scaling, one would obtain a 50% reduction in execution time. Overall, a parallel performance of a 45% reduction is nearly ideal, and it was achieved by changing only roughly 10 lines of code between the serial solver and the parallel solver.
Utilizing the parallelized version of the code, we were able to generate data of a very large system for a long period of time, specifically N x = N y = 8192 with uniform grid spacing ∆x = ∆y = 5 km and 2 24 time-steps with uniform time-steps of size The latter required a change to approximately 10 lines of code, and exhibited a strong scaling efficiency of 86.9% ± 1.2%.
With the development of this module, we have provided some tools that will be of use for parallelizing serial codes. In the future, we hope to implement the following features to increase performance and functionality of EZP: (i) different default domain decompositions, e.g., horizontal slabs, or rectangular chunks not containing an entire column or row of the grid, (ii) 265 native compatibility with domains of different dimensionalities, e.g., three-dimensional grids, n-dimensional parallel FFTs, (iii) Both simulations were initialized at the statistical steady state, as described by Equations (15) and (16) of Hottovy and Stechmann (2015). expanded FFT options, e.g., utilizing parallel one-dimensional FFTs for transforms along dimensions split between processors, and (iv) compatibility with the C and C++ languages. We have begun the development process for EZP in the C++ programming language, although it has not yet undergone rigorous testing. It would be interesting to also pursue similar software in other languages, such as Julia and Python. Further, the EZP module is not limited to 2D geophysical systems and could be used more broadly to expand the scope of various computational projects. Although it was presented here for 2D examples, it can be used for systems of various dimensionalities. For example, one could declare a one-dimensional grid as a 2D grid with a single column or use the EZP subroutines on 2D slices of 3D grids.
subroutine of the solver. Algorithm 6 includes code for the time-stepping subroutine of the solver.