OpenArray v1.0: a simple operator library for the decoupling of ocean modeling and parallel computing

Rapidly evolving computational techniques are making a large gap between scientific aspiration and code implementation in climate modeling. In this work, we design a simple computing library to bridge the gap and decouple the work of ocean modeling from parallel computing. This library provides 12 basic operators that feature userfriendly interfaces, effective programming, and implicit parallelism. Several state-of-the-art computing techniques, including computing graph and just-in-time compiling, are employed to parallelize the seemingly serial code and speed up the ocean models. These operator interfaces are designed using native Fortran programming language to smooth the learning curve. We further implement a highly readable and efficient ocean model that contains only 1860 lines of code but achieves a 91 % parallel efficiency in strong scaling and 99 % parallel efficiency in weak scaling with 4096 Intel CPU cores. This ocean model also exhibits excellent scalability on the heterogeneous Sunway TaihuLight supercomputer. This work presents a promising alternative tool for the development of ocean models.


Introduction
Many earth system models have been developed in the past several decades to improve the predictive understanding of the earth system (Bonan and Doney, 2018;Collins et al., 2018;Taylor et al., 2012). These models are becoming in-25 creasingly complicated, and the amount of code has expanded from a few thousand lines to tens of thousands or even millions of lines. In terms of software engineering, an increase in code causes the models to be more difficult to develop and maintain. 30 The complexity of these models mainly originates from three aspects. First, more model components and physical processes have been embedded into the earth system models, leading to a 10-fold increase in the amount of code (e.g., Alexander and Easterbrook, 2015). Second, some heteroge- 35 neous and advanced computing platforms (e.g., Lawrence et al., 2018) have been widely used by the climate modeling community, resulting in a 5-fold increase in the amount of code (e.g., Xu et al., 2015). Last, most of the model programs need to be rewritten due to the continual development 40 of novel numerical methods and meshes. The promotion of novel numerical methods and technologies produced in the fields of computational mathematics and computer science have been limited in climate science because of the extremely heavy burden caused by program rewriting and migration. 2 X. Huang et al.: OpenArray v1.0 tures which are equipped with two or more kinds of cores or processing elements (Shan, 2006), thus making a much more sophisticated computing environment for climate modelers than ever before (Bretherton et al., 2012). Clearly, transiting the current earth system models to the next generation of computing environments will be highly challenging and disruptive. Overall, complex codes in earth system models combined with rapidly evolving computational techniques create a very large gap between scientific aspiration and code implementation in the climate modeling community. 10 To reduce the complexity of earth system models and bridge this gap, a universal and productive computing library is a promising solution. Through establishing an implicit parallel and platform-independent computing library, the complex models can be simplified and will no longer need ex-15 plicit parallelism and transiting, thus effectively decoupling the development of ocean models from complicated parallel computing techniques and diverse heterogeneous computing platforms.
Many efforts have been made to address the complex-20 ity of parallel programming for numerical simulations, such as operator overloading, source-to-source translator, and domain-specific language (DSL). Operator overloading supports the customized data type and provides simple operators and function interfaces to implement the model algo- 25 rithm. This technique is widely used because the implementation is straightforward and easy to understand (Corliss and Griewank, 1994;Walther et al., 2003). However, it is prone to work inefficiently because overloading execution induces numerous unnecessary intermediate variables, consuming valuable memory bandwidth resources. Using a source-to-source translator offers another solution. As indicated by the name, this method converts one language, which is usually strictly constrained by self-defined rules, to another (Bae et al., 2013;Lidman et al., 2012). It requires tremendous work to develop and maintain a robust source-to-source compiler. Furthermore, DSLs can provide high-level abstraction interfaces that use mathematical notations similar to those used by domain scientists so that they can write much more concise and more straightforward code. Some outstanding DSLs, such 40 as ATMOL (van Engelen, 2001), ICON DSL (Torres et al., 2013), STELLA (Gysi et al., 2015), and ATLAS (Deconinck et al., 2017), are used by the numerical model community.
Although they seem to be source-to-source techniques, DSLs are newly defined languages and produce executable pro-45 grams instead of target languages. Therefore, the new syntax makes it difficult for the modelers to master the DSLs. In addition, most DSLs are not yet supported by robust compilers due to their relatively short history. Most of the source-tosource translators and DSLs still do not support the rapidly 50 evolving heterogeneous computing platforms, such as the Chinese Sunway TaihuLight supercomputer which is based on the homegrown Sunway heterogeneous many-core processors and located at the National Supercomputing Center in Wuxi.

55
Other methods such as COARRAY Fortran and CPP templates provide alternative ways. Using COARRAY Fortran, a modeler has to control the reading and writing operation of each image (Mellor-Crummey et al., 2009). In a sense, one has to manipulate the images in parallel instead of writing se-60 rial code. In term of CPP templates, it is usually suitable for small amounts of code and difficult for debugging (Porkoláb et al., 2007).
Inspired by the philosophy of operator overloading, source-to-source translating and DSLs, we integrated the ad- 65 vantages of these three methods into a simple computing library which is called OpenArray. The main contributions of OpenArray are as follows: -Easy to use. The modelers can write simple operator expressions in Fortran to solve partial differential equa-70 tions (PDEs). The entire program appears to be serial and the modelers do not need to know any parallel computing techniques. We summarized 12 basic generalized operators to support whole calculations in a particular class of ocean models which use the finite difference 75 method and staggered grid.
-High efficiency. We adopt some advanced techniques, including intermediate computation graphing, asynchronous communication, kernel fusion, loop optimization, and vectorization, to decrease the consumption 80 of memory bandwidth and improve efficiency. Performance of the programs implemented by OpenArray is similar to the original but manually optimized parallel program.
-Portability. Currently OpenArray supports both CPU 85 and Sunway platforms. More platforms including GPU will be supported in the future. The complexity of crossplatform migration is moved from the models to Ope-nArray. The applications based on OpenArray can then be migrated seamlessly to the supported platforms.

90
Furthermore, we developed a numerical ocean model based on the Princeton Ocean Model (POM; Blumberg and Mellor, 1987) to test the capability and efficiency of OpenArray.
The new model is called the Generalized Operator Model of the Ocean (GOMO). Because the parallel computing details 95 are completely hidden, GOMO consists of only 1860 lines of Fortran code and is more easily understood and maintained than the original POM. Moreover, GOMO exhibits excellent scalability and portability on both central processing unit (CPU) and Sunway platforms.

100
The remainder of this paper is organized as follows. Section 2 introduces some concepts and presents the detailed mathematical descriptions of formulating the PDEs into operator expressions. Section 3 describes the detailed design and optimization techniques of OpenArray. The implemen-105 tation of GOMO is described in Sect. 4. Section 5 evaluates the performances of OpenArray and GOMO. Finally, discussion and conclusion are given in Sects. 6 and 7, respectively.

Array
To achieve this simplicity, we designed a derived data type, Array, which inspired our project name, OpenArray. The new Array data type comprises a series of information, including a 3-dimensional (3-D) array to store data, a pointer to the 10 computational grid, a Message Passing Interface (MPI) communicator, the size of the halo region, and other information about the data distribution. All the information is used to manipulate the Array as an object to simplify the parallel computing. In traditional ocean models, calculations for each grid 15 point and the i, j , and k loops in the horizontal and vertical directions are unavoidable. The advantage of taking the Array as an object is the significant reduction in the number of loop operations in the models, making the code more intuitive and readable. When using the OpenArray library in a 20 program, one can use type(Array) to declare new variables.

Operator
To illustrate the concept of an operator, we first take a 2dimensional (2-D) continuous equation solving sea surface elevation as an example: where η is the surface elevation, U and V are the zonal and meridional velocities, and D is the depth of the fluid column. We choose the finite difference method and staggered Arakawa C grid scheme, which are adopted by most regional 30 ocean models. In Arakawa C grid, D is calculated at the centers, U component is calculated at the left and right side of the variable D, and V component is calculated at the lower and upper side of the variable D (Fig. 1). Variables (D, U, V ) located at different positions own different sets of gird incre-35 ments. Taking the term ∂DU ∂x as an example, we firstly apply linear interpolation to obtain the D s value at U point represented by tmpD. Through a backward difference to the product of tmpD and U , the discrete expression of ∂DU ∂x can be obtained.
TS4 where dx(i, j ) * = 0.5 · (dx(i, j ) + dx(i − 1, j )), dy(i, j ) * = 0.5 · (dy(i, j ) + dy(i, j − 1)), and subscripts η t+1 and η t−1 denote the surface elevations at the 50 (t +1) time step and (t −1) time step. To simplify the discrete form, we introduce some notation for the differentiation (δ x f , δ y b ) and interpolation (() x f , () y b ). The δ and overbar symbols define the differential operator and average operator, respectively. The subscript x or y denotes that the operation acts in 55 the x or y direction, and the superscript f or b denotes that the approximation operation is forward or backward. Table 1 lists the detailed definitions of the 12 basic operators. The term var denotes a 3-D model variable. All 12 operators for the finite difference calculations are named using 60 three letters in the form [A|D][X|Y |Z][F |B]. The first letter contains two options, A or D, indicating an average or a differential operator. The second letter contains three options, X, Y , or Z, representing the direction of the operation. The last letter contains two options, F or B, representing forward 65 or backward operation. The dx, dy, and dz are the distances between two adjacent grid points along the x, y, and z directions.
Using the basic operators, Eq. (4) is expressed as Thus, Then, Eq. (6) can be easily translated into a line of code using operators (the bottom left part in Fig. 2). Compared with the pseudo-codes (the right part), the corresponding implemen-75 tation by operators is more straightforward and more consistent with the equations. Next, we will use the operators in shallow water equations, which are more complicated than those in the previous case. Assuming that the flow is in hydrostatic balance and that the 80 density and viscosity coefficients are constant, and neglecting the molecular friction, the shallow water equations are where f is the Coriolis parameter, g is the gravitational acceleration, and µ is the coefficient of kinematic viscosity. Using the Arakawa C grid and leapfrog time difference scheme, 5 the discrete forms represented by operators are shown in Eqs. (10)-(12).
As the shallow water equations are solved, spatial average and differential operations are called repeatedly. Implementing these operations is troublesome and thus it is favorable to abstract these common operations from PDEs and encapsulate them into user-friendly, platform-independent, and im-15 plicit parallel operators. As shown in Fig. 3, we require only three lines of code to solve the shallow water equations. This more realistic case suggests that even more complex PDEs can be constructed and solved by following this elegant approach.

Abstract staggered grid
Most ocean models are implemented based on the staggered Arakawa grids (Arakawa and Lamb, 1981;Griffies et al., 2000). The variables in ocean models are allocated at different grid points. The calculations that use these variables are performed after several reasonable interpolations 10 or differences. When we call the differential operations on a staggered grid, the difference value between adjacent points should be divided by the grid increment to obtain the final result. Setting the correct grid increment for modelers is troublesome work that is extremely prone to error, especially 15 when the grid is nonuniform. Therefore, we propose an abstract staggered grid to support flexible switching of operator calculations among different staggered grids. When the grid information is provided at the initialization phase of OpenArray, a set of grid increments, including horizontal increments 20 (dx(i, j ), dy(i, j )), and vertical increment (dz(k)), will be combined with each corresponding physical variable through grid binding. Thus, the operators can implicitly set the correct grid increments for different Array variables, even if the grid is nonuniform.

25
As shown in Fig. 4, the cubes in the (a), (b), (c), and (d) panels are the minimum abstract grid accounting for 1/8 of the volume of the cube in panel (e). The eight points of each cube are numbered sequentially from 0 to 7, and each point has a set of grid increments, i.e., dx, dy, and dz. For example, 30 all the variables of an abstract Arakawa A grid are located at point 3. For the Arakawa B grid, the horizontal velocity Array (U , V ) is located at point 0, the temperature (T ), the salinity (S), and the depth (D) are located at point 3, and the vertical velocity Array (W ) is located at point 7. For the 35 Arakawa C grid, Array U is located at point 2 and Array V is located at point 1. In contrast, for the Arakawa D grid, Array U is located at point 1 and Array V is located at point 2.
When we call the average and differential operators mentioned in Table 1, e.g., on the abstract Arakawa C grid, the 40 position of Array D is point 3, and the average AXB operator acting on Array D will change the position from point 3 to point 1. Since Array U is also allocated at point 1, the operation AXB(D) · U is allowed. In addition, the subsequent differential operator on Array AXB(D) · U will change the 45 position of Array DXF (AXB(D) · U ) from point 1 to point 3.
The jumping rules of different operators are given in Table 2. Due to the design of the abstract staggered grids, the jumping rules for the Arakawa A, B, C, and D grids are fixed. 50 A change in the position of an array is determined only by the direction of a certain operator acting on that array.
If users change the Arakawa grid type, first the position information of each physical variable needs to be reset (shown in Fig. 4). Then the discrete form of each equation needs to 55 be redesigned. We take the Eq. (1) switching from Arakawa C grid to Arakawa B grid as an example. The positions of the horizontal velocity Array U and Array V are changed to point 0; Array η and Array D stay the same. The discrete form is changed from Eq. (4) into Eq. (13), and the 60 corresponding implementation by operators is changed from Eq. (6) into Eq. (14) CE1 .
TS7 The position information and jumping rules are used to 65 implicitly check whether the discrete form of an equation is correct. The grid increments are hidden by all the differential operators, thus it makes the code simple and clean. In addition, since the rules are suitable for multiple staggered Arakawa grids, the modelers can flexibly switch the ocean 70 model between different Arakawa grids. Notably, the users of OpenArray should input the correct positions of each array in the initialization phase. The value of the position is an input parameter when declaring an Array. An error will 6 X. Huang et al.: OpenArray v1.0 Figure 3. Implementation of the shallow water equations by basic operators. elf, el, and elb denote sea surface elevations at times (t + 1), t, and (t − 1), respectively. Uf, U and Ub TS5 denote the zonal velocity at times (t + 1), t, and (t − 1), respectively. Vf, V , and Vb TS6 denote the meridional velocity at times (t + 1), t, and (t − 1), respectively. Variable aam denotes the viscosity coefficient. be reported if an operation is performed between misplaced points. Although most of the existing ocean models use finite difference or finite volume methods on structured or semi-structured meshes (e.g., Blumberg and Mellor, 1987;5 Shchepetkin and McWilliams, 2005), there are still some ocean models using unstructured meshes (e.g., Chen et al., 2003;Korn, 2017), and even the spectral element method (e.g., Levin et al., 2000). In our current work, we design the basic operators only for finite difference and finite volume 10 methods with structured grids. More customized operators for the other numerical methods and meshes will be implemented in our future work.

Design of OpenArray
Through the above operator notations in Table 1, ocean mod-15 elers can quickly convert the discrete PDEs into the corresponding operator expression forms. The main purpose of OpenArray is to make complex parallel programming transparent to the modelers. As illustrated in Fig. 5, we use a computation graph as an intermediate representation, mean-20 ing that the operator expression forms written in Fortran will be translated into a computation graph with a particular data structure. In addition, OpenArray will use the intermediate computation graph to analyze the dependency of the distributed data and produce the underlying parallel code. 25 Finally, we use stable and mature compilers, such as the GNU Compiler Collection (GCC), Intel compiler (ICC), and Sunway compiler (SWACC), to generate the executable programs according to different back-end platforms. These four  steps and some related techniques are described in detail in this section.

Operator expression
Although the basic generalized operators listed in Table 1 are only suitable to execute 1st-order difference, other high-order 5 difference or even more complicated operations can be combined by these basic operators. For example, a 2nd-order difference operation can be expressed as δ x f (δ x b (var)). Supposing the grid distance is uniform, the corresponding discrete form is [var(i +1, j, k)+var(i −1, j, k)−2·var(i, j, k)]/dx 2 .
Using these operators to express the discrete PDE, the code and formula are very similar. We call this effect "the 15 self-documenting code is the formula". Figure 6 shows the one-to-one correspondence of each item in the code and the items in the sea surface elevation equation. The code is very easy to program and understand. Clearly, the basic operators and the combined operators greatly simplify the development 20 and maintenance of ocean models. The complicated parallel and optimization techniques are hidden behind these operators. Modelers no longer need to care about details and can escape from the "parallelism swamp", and can therefore concentrate on the scientific issues.

Intermediate computation graph
Considering the example mentioned in Fig. 6, if one needs to compute the term DXF (AXB(D)·u) with the traditional operator overloading method, one first computes AXB(D) and stores the result into a temporary array (named tmp1), and 30 then executes (tmp1 · u) and stores the result into a new array, tmp2. The last step is to compute DXF (tmp2) and store the result in a new array, tmp3. Numerous temporary arrays consume a considerable amount of memory, making the efficiency of operator overloading poor. 35 To solve this problem, we convert an operator expression form into a directed and acyclic graph, which consists of basic data and function nodes, to implement a so-called lazy expression evaluation (Bloss et al., 1988;Reynolds, 1999). Unlike the traditional operator overloading method, we overload 40 all arithmetic functions to generate an intermediate computation graph rather than to obtain the result of each function. This method is widely used in deep-learning frameworks, e.g., TensorFlow (Abadi et al., 2016) and Theano (Bastien et al., 2012), to improve computing efficiency. Figure 7 shows 45 the procedure of parsing the operator expression form of the sea level elevation equation into a computation graph. The input variables in the square boxes include the sea surface elevation (elb), the zonal velocity (u), the meridional velocity (v), and the depth (D). Variable dt2 is a constant equal 50 to 2 · dt. The final output is the sea surface elevation at the next time step (elf). The operators in the circles have been overloaded in OpenArray. In summary, all the operators provided by OpenArray are functions for the Array calculation, in which the "=" notation is the assignment function, the "-55 " notation is the subtraction function, the "*" notation is the multiplication function, the "+" notation is the addition function, DXF and DYF are the differential functions, and AXF and AYF are the average functions.

Code generation
Given a computation graph, we design a lightweight engine 5 to generate the corresponding source code (Fig. 8). Each operator node in the computation graph is called a kernel. The sequence of all kernels in a graph is usually fused into a large kernel function. Therefore, the underlying engine schedules and executes the fused kernel once and obtains the final result 10 directly without any auxiliary or temporary variables. Simultaneously, the scheduling overhead of the computation graph and the startup overhead of the basic kernels can be reduced.
Most of the scientific computational applications are limited by the memory bandwidth and cannot fully exploit the 15 computing power of a processor. Fortunately, kernel fusion is an effective optimization method to improve memory locality. When two kernels need to process some data, their fusion holds shared data in the memory. Prior to the kernel fusion, the computation graph is analyzed to find the operator nodes 20 that can be fused, and the analysis results are stored in several subgraphs. Users can access any individual subgraph by assigning the subgraph to an intermediate variable for diagnostic purposes. After being given a series of subgraphs, the underlying engine dynamically generates the corresponding 25 kernel function in C++ using just-in-time (JIT) compilation techniques (Suganuma and Yasue, 2005). The JIT compiler used in OpenArray can fuse numbers of operators into a large compiled kernel. The benefit of fusing operators is to alleviate memory bandwidth limitations and improve performance 30 compared with executing operators one by one. In order to generate a kernel function based on a subgraph, we first add the function header and variable definitions according to the name and type in the Array structure. And then we add the loop head through the dimension information. Finally, we 35 perform a depth-first walk on the expression tree to convert data, operators, and assignment nodes into a complete expression including load variables, arithmetic operation, and equal symbol with C++ language.
Notably, the time to compile a single kernel function is 40 short but practical applications usually need to be run for thousands of time steps and the overhead of generating and compiling the kernel functions for the computation graph is extremely high. Therefore, we generate a fusion kernel func-tion only once for each subgraph, and put it into a function 45 pool. Later, when facing the same computation subgraph, we fetch the corresponding fusion kernel function directly from the pool. Since the arrays in OpenArray are distributed among different processing units, and the operator needs to use the 50 data in the neighboring points, to ensure the correctness it is necessary to check the data consistency before fusion. The use of different data-splitting methods for distributed arrays can greatly affect computing performance. The current datasplitting method in OpenArray is the widely used block-55 based strategy. Solving PDEs on structured grids often divides the simulated domain into blocks that are distributed to different processing units. However, the differential and average operators always require their neighboring points to perform array computations. Clearly, ocean modelers have to 60 frequently call corresponding functions to carefully control the communication of the local boundary region.
Therefore, we implemented a general boundary management module to implicitly maintain and update the local boundary information so that the modelers no longer need to 65 address the message communication. The boundary management module uses asynchronous communication to update and maintain the data of the boundary region, which is useful for simultaneous computing and communication. These procedures of asynchronous communication are implicitly in-70 voked when calling the basic kernel or the fused kernel to ensure that the parallel details are completely transparent to the modelers. For the global boundary conditions of the limited physical domains, the values at the physical border are always set to zero within the operators and operator expres-75 sions. In realistic cases, the global boundary conditions are set by a series of functions (e.g., radiation, wall) provided by OpenArray.

Portable program for different back-end platforms
With the help of dynamic code generation and JIT compi-80 lation technology, OpenArray can be migrated to different back-end platforms. Several basic libraries, including Boost C++ libraries and Armadillo library, are required. The JIT compilation module is based on low-level virtual machine (LLVM), thus theoretically the module can only be ported to 85 platforms supporting LLVM. If LLVM is not supported, as on the Sunway platform, one can generate the fusion kernels in advance by running the ocean model on an X86 platform.  If the target platform uses CPUs with acceleration cards, such as GPU clusters, it is necessary to add control statements in the CPU code, including data transmission, calculation, synchronous, and asynchronous statements. In addition, the accelerating solution should involve the selection of the best 5 parameters, e.g., "blockDim" and "gridDim" on GPU platforms. In short, the code generation module of OpenArray also needs to be refactored to be able to generate codes for different back-end platforms. The application based on Ope-nArray can then be migrated seamlessly to the target plat-10 form. Currently, we have designed the corresponding source code generation module for Intel CPU and Sunway processors in OpenArray.
According to the TOP500 list released in November 2018, the Sunway TaihuLight is ranked third in the world, with 15 a LINPACK benchmark rating of 93 petaflops provided by Sunway many-core processors (or Sunway CPUs). As shown in Fig. 9, every Sunway CPU includes 260 processing elements (or cores) that are divided into four core groups. Each core group consists of 64 computing processing elements 20 (CPEs) and a management processing element (MPE) (Qiao et al., 2017). CPEs handle large-scale computing tasks and the MPE is responsible for the task scheduling and communication. The relationship between MPE and CPE is like that between CPU and many-core accelerator, except that they 25 are fused into a single Sunway processor sharing a unified memory space. To make the most of the computing resources of the Sunway TaihuLight, we generate kernel functions for the MPE, which is responsible for the thread control, and CPE, which performs the computations. The kernel func-30 tions are fully optimized with several code optimization techniques (Pugh, 1991) such as loop tiling, loop aligning, singleinstruction multiple-date (SIMD) vectorization, and function inline. In addition, due to the high memory access latency of Figure 9. The MPE-CPE hybrid architecture of the Sunway processor. Every Sunway processor includes 4 core groups (CGs) connected by the network-on-chip (NoC). Each CG consists of a management processing element (MPE), 64 computing processing elements (CPEs) and a memory controller (MC). The Sunway processor uses the system interface (SI) to connect with outside devices.
CPEs, we accelerate data access by providing instructions for direct memory access in the kernel to transfer data between the main memory and local memory (Fu et al., 2017).

Implementation of GOMO
In this section, we introduce how to implement a numerical 5 ocean model using OpenArray. The most important step is to derive the primitive discrete governing equations in operator expression form, and then the following work is completed by OpenArray.
The fundamental equations of GOMO are derived from 10 POM. GOMO features a bottom-following free-surface staggered Arakawa C grid. To effectively evolve the rapid surface fluctuations, GOMO uses the mode-splitting algorithm inherited from POM to address the fast-propagating surface gravity waves and slow-propagating internal waves in 15 barotropic (external) and baroclinic (internal) modes, respectively. The details of the continuous governing equations, the corresponding operator expression form, and the descriptions of all the variables used in GOMO are listed in the appendices A-C, respectively. 20 Figure 10 shows the basic flow diagram of GOMO. At the beginning, we initialize OpenArray to make all operators suitable for GOMO. After loading the initial values and the model parameters, the distance information is input into the differential operators through grid binding. In the exter-25 nal mode, the main consumption is computing the 2-D sea surface elevation η and column-averaged velocity (Ua TS8 , Va TS9 ). In the internal mode, 3-D array computations predominate in order to calculate baroclinic motions (U, V , W ), tracers (T , S, ρ), and the turbulence closure scheme (q 2 , q 2 l) 30 (Mellor and Yamada, 1982), where U, V , and W are the velocity fields in the x, y, and σ directions and T , S, and ρ are the potential temperature, the salinity, and the density, respectively. q 2 /2, and q 2 l/2 are the turbulence kinetic energy and production of turbulence kinetic energy with turbulence 35 length scale, respectively.
When the user dives into the GOMO code, the main time stepping loop in GOMO appears to run on a single processor. However, as described above, implicit parallelism is the most prominent feature of the program using OpenArray. The op-40 erators in OpenArray, not only the difference and average operators, but also the "+", "-", "*", "/" and "=" operators in the Fortran code, are all overloaded for the special data structure "Array". The seemingly serial Fortran code is implicitly converted to parallel C++ code by OpenArray, and the parallelization is hidden from the modelers. Because the complicated parallel optimization and tuning processes are decoupled from the ocean modeling, we com-5 pletely implemented GOMO based on OpenArray in only 4 weeks, whereas implementation may take several months or even longer when using the MPI or CUDA library.
In comparison with the existing POM and its multiple variations, including (to name a few) Stony Brook Paral-10 lel Ocean Model (sbPOM), mpiPOM, and POMgpu, GOMO has less code but is more powerful in terms of compatibility. As shown in Table 3, the serial version of POM (POM2k) contains 3521 lines of code. Models sbPOM and mpiPOM are parallelized using MPI, while POMgpu is based on MPI 15 and CUDA-C. The codes of sbPOM, mpiPOM, and POMgpu are extended to 4801, 9680, and 30 443 lines, respectively. In contrast, the code of GOMO is decreased to 1860 lines. Moreover, GOMO completes the same function as the other approaches while using the least amount of code (Table 4), 20 since the complexity has been transferred to OpenArray, which includes about 11 800 lines of codes.
In addition, poor portability considerably restricts the use of advanced hardware in oceanography. With the advantages of OpenArray, GOMO is adaptable to different hardware ar-25 chitectures, such as the Sunway processor. The modelers do not need to modify any code when changing platforms, eliminating the heavy burden of transmitting code. As computing platforms become increasingly diverse and complex, GOMO becomes more powerful and attractive than the machine-30 dependent models.

Results
In this section, we first evaluate the basic performance of OpenArray using benchmark tests on a single CPU platform. After checking the correctness of GOMO through an ideal 35 seamount test case, we use GOMO to further test the scalability and efficiency of OpenArray.

Benchmark testing
We choose two typical PDEs and their implementations from Rodinia v3.1, which is a benchmark suite for heterogeneous 40 computing (Che et al., 2009), as the original version. For comparison, we re-implement these two PDEs using Ope-nArray. In addition, we added two other test cases. As shown in Table 5, the 2-D continuity equation is used to solve sea surface height, and its continuous form is shown in Eq. (1). 45 The 2-D heat diffusion equation is a parabolic PDE that describes the distribution of heat over time in a given region. Hotspot is a thermal simulation used for estimating processor temperature on structured grids (Che et al., 2009;Huang et al., 2006). We tested one 2-D case (Hotspot2D) and one 3-50 D case (Hotspot3D) of this program. The average runtime for 100 iterations is taken as the performance metric. All tests are executed on a single workstation with an Intel Xeon E5-2650 CPU. The experimental results show that the performance of OpenArray versions is comparable to the original versions. 55

Validation tests of GOMO
The seamount problem proposed by Beckmann and Haidvogel is a widely used ideal test case for regional ocean models (Beckmann and Haidvogel, 1993). It is a stratified Taylor column problem which simulates the flow over an isolated 60 seamount with a constant salinity and a reference vertical temperature stratification. An eastward horizontal current of 0.1 m s −1 TS10 is added at model initialization. The southern and northern boundaries are closed. If the Rossby number is small, an obvious anticyclonic circulation is trapped by the 65 mount in the deep water.
Using the seamount test case, we compare GOMO and sbPOM results. The configurations of both models are exactly the same. Figure 11 shows that GOMO and sbPOM both capture the anticyclonic circulation at 3500 m depth. 70 The shaded plot shows the surface elevation, and the array plot shows the current at 3500 m. Panels 11a, b, and c show the results of GOMO, sbPOM, and the difference (GOMO-sbPOM), respectively. The differences in the surface elevation and deep currents between the two models are negligible 75 (Fig. 11c).

The weak and strong scalability of GOMO
The seamount test case is used to compare the performance of sbPOM and GOMO in a parallel environment. We use the X86 cluster at the National Supercomputing Center in Wuxi, China, which provides 5000 Intel Xeon E5-2650 v2 5 CPUs for our account at most. Figure 12a shows the result of a strong scaling evaluation, in which the model size is fixed at 2048 × 2048 × 50. The dashed line indicates the ideal speedup. For the largest parallelism with 4096 processes, GOMO and sbPOM achieve 91 % and 92 % paral-10 lel efficiency, respectively. Figure 12b shows the weak scalability of sbPOM and GOMO. In the weak scaling test, the model size for each process is fixed at 128×128×50, and the number of processes is gradually increased from 16 to 4096. Taking the performance of 16 processes as a baseline, we de-15 termine that the parallel efficiencies of GOMO and sbPOM using 4096 processes are 99.0 % and 99.2 %, respectively.

Testing on the Sunway platform
We also test the scalability of GOMO on the Sunway platform. Supposing that the baseline is the runtime of GOMO at 20 10 000 Sunway cores with a grid size of 4096×4096×50, the parallel efficiency of GOMO can still reach 85 % at 150 000 cores, as shown in Fig. 13. However, we notice that the scalability declines sharply when the number of cores exceeds 150 000. There are two reasons leading to this decline. First, 25 the block size assigned to each core decreases as the number of cores increases, causing more communication during boundary region updating. Second, some processes cannot be accelerated even though more computing resources are available; for example, the time spent on creating the computa-30 tion graph, generating the fusion kernels, and compiling the JIT cannot be reduced. Even though the fusion-kernel codes are generated and compiled only once at the beginning of a job, it consumes about 2 min. In a sense, OpenArray performs better when processing large-scale data, and GOMO is 35 more suitable for high-resolution scenarios. In the future, we will further optimize the communication and graph-creating modules to improve the efficiency for large-scale cores.

Discussion
As we mentioned in Sect. 1, the advantages of OpenArray are 40 its ease of use, high efficiency, and portability. Using Ope-nArray, modelers without any parallel computing skill or experience can write simple operator expressions in Fortran to implement complex ocean models. The ocean models can be run on any CPU or Sunway platforms which have deployed 45 the OpenArray library. We call this effect "write once, run everywhere". Other similar libraries (e.g., ATMOL, ICON DSL, STELLA, and COARRAY) require users to manually control the boundary communication and task scheduling to some extent. In contrast, OpenArray implements completely 50 implicit parallelism with user-friendly interfaces and programming languages.  However, there are still several problems to be solved in the development of OpenArray. The first issue is computational efficiency. Once a variable is in one of the processor registers or in the highest speed cache, it should be used as much as possible before being replaced. In fact, we should 5 never move variables more than once each time step. The memory consumption brought by overloading techniques is usually high due to the unnecessary variable moving and unavoidable cache missing. The current efficiency and scalability of GOMO are close to sbPOM since we have adopted 10 a series of optimization methods, such as memory pool, graph computing, JIT compilation, and vectorization, to alleviate the requirement of memory bandwidth. However, we have to admit that we cannot fully solve the memory bandwidth limit problem at present. We think that time skew-15 ing is a cache-oblivious algorithm for stencil computations (Frigo and Strumpen, 2005) since it can exploit temporal locality optimally throughout the entire memory hierarchy. In addition, the polyhedral model may be another potential approach, which uses an abstract mathematical representa-20 tion based on integer polyhedral to analyze and optimize the memory access pattern of a program.
The second issue is that the current OpenArray version cannot support customized operators. When modelers try out another higher-order advection or any other numerical 25 scheme, the 12 basic operators provided by OpenArray are not abundant. We considered using a template mechanism to support the customized operators. The rules of operations are defined in a template file, where the calculation form of each customized operator is described by a regular expression. If 30 users want to add a customized operator, they only need to append a regular expression into the template file.
OpenArray and GOMO will continue to be developed, and the following three key improvements are planned for the following years. 35 First, we are developing the GPU version of OpenArray. During the development, the principle is to keep hot data in GPU memory or directly swapping between GPUs and avoid returning data to the main CPU memory. NVLink provides high bandwidth and outstanding scalability for GPU-to-CPU 40 or GPU-to-GPU communication and addresses the interconnect issue for multi-GPU and multi-GPU/CPU systems.
Second, the data input/output is becoming a bottleneck of earth system models as the resolution increases rapidly. At present we encapsulate the PnetCDF library to provide sim-45 ple I/O interfaces, such as load operation and store operation. A climate fast input/output (CFIO) library (Huang et al., 2014) will be implemented into OpenArray in the next few years. The performance of CFIO is approximately 220 % faster than PnetCDF because of the overlapping of I/O and 50 computing. CFIO will be merged into the future version of OpenArray and the performance is expected to be further improved. 14 X. Huang et al.: OpenArray v1.0 Finally, as for most of the ocean models, GOMO also faces the load imbalance issue. We are adding the more effective load balance schemes, including space-filling curve (Dennis, 2007) and curvilinear orthogonal grids, into OpenArray in order to reduce the computational cost on land points.

5
OpenArray is a product of the collaboration between oceanographers and computer scientists. It plays an important role in simplify the porting work on the Sunway TaihuLight supercomputer. We believe that OpenArray and GOMO will continue to be maintained and upgraded. We aim 10 to promote it to the model community as a development tool for future numerical models.

Conclusions
In this paper, we design a simple computing library (Ope-nArray) to decouple ocean modeling and parallel computing.

15
OpenArray provides 12 basic operators that are abstracted from PDEs and extended to ocean model governing equations. These operators feature user-friendly interfaces and an implicit parallelization ability. Furthermore, some stateof-the-art optimization mechanisms, including computation 20 graphing, kernel fusion, dynamic source code generation, and JIT compiling, are applied to boost the performance. The experimental results prove that the performance of a program using OpenArray is comparable to that of well-designed programs using Fortran. Based on OpenArray, we implement 25 a numerical ocean model (GOMO) with high productivity, enhanced readability, and excellent scalable performance. Moreover, GOMO shows high scalability on both CPU systems and the Sunway platform. Although more realistic tests are needed, OpenArray may signal the beginning of a new 30 frontier in future ocean modeling through ingesting basic operators and cutting-edge computing techniques.

Appendix A: Continuous governing equations
The equations governing the baroclinic (internal) mode in GOMO are the 3-dimensional hydrostatic primitive equations.
where F u , F v , F q 2 , and F q 2 l are horizontal kinematic viscosity terms of u, v, q 2 , and q 2 l, respectively. F T and F S are horizontal diffusion terms of T and S, respectively. W is the 15 wall proximity function.
The equations governing the barotropic (external) mode in GOMO are obtained by vertically integrating the baroclinic 25 equations.
where F u a and F v a are the horizontal kinematic viscosity 30 terms of U A and V A , respectively. G u a and G v a are the dis-16 X. Huang et al.: OpenArray v1.0 persion terms of U A and V A , respectively. The subscript "A" denotes vertical integration. The discrete governing equations of baroclinic (internal) mode expressed by operators are shown as below: where F u , F v , F q 2 , and F q 2 l are horizontal kinematic viscosity terms of u, v, q 2 , and q 2 l, respectively. F T and F S are horizontal diffusion terms of T and S, respectively.
The discrete governing equations of barotropic (external) mode expressed by operators are shown as below: The description of each symbol in the governing equations is listed below. Remarks from the language copy-editor CE1 Please check that the meaning of your sentence is intact.

CE2
Please note that Figs. 5 and 9 were copy-edited (capitalization and hyphens). Please confirm these changes CE3 Please confirm the change.

CE4
Please confirm the minor change to this section.

CE5
Please note that "State's Key Project of Research and Development Plan" has been changed to "China's National Key R&D Program". Please confirm if this is correct.

Remarks from the typesetter TS1
The composition of Figs. 9, 11, and 13 has been adjusted according to our standards.

TS2
Please note change to · for multiplication signs as we do not use * .

TS3
Please confirm equation as the fraction was too long for the two column layout.

TS4
Please confirm equation as the fraction was too long for the two column layout.

TS5
Could this be denoted as U f , and U b ?

TS6
Could this be denoted as V f and V b ?

TS7
Please confirm equation as the fraction was too long for the two column layout.

TS8
Please check; could this be written as U a ?

TS9
Please check; could this be written as V a ?

TS10
Please note that units have been changed to exponential format. Please check all instances.

TS11
Please provide a reference list entry including creators, title, and date of last access.

TS12
Please provide a reference list entry including creators, title, and date of last access.

TS13
Please provide a reference list entry including creators, title, and date of last access.

TS14
Could this be written as u a and v a ?

TS15
Please confirm that the supplement should be included with your paper. If it should be included, please also mention it in the paper. Currently there are two PDF files uploaded in the zip file (A_ simple_ user_ Manual_ for_ OpenArray_ Version_ 1.0 (diff) and A_ simple_ user_ Manual_ for_ OpenArray_ Version_ 1.0 (v3)). Please let me know if both versions should be published as a Supplement.

TS16
Please note that there is a discrepancy between funding information provided by you in the acknowledgements and the funding information you indicated during manuscript registration, which we used to create this section. Please doublecheck your acknowledgements to see whether repeated information can be removed from the acknowledgement or changed accordingly. If further funders should be added to this section, please provide the funder names and the grant numbers. Thanks.

TS17
Please provide date of last access.

TS18
Please provide volume with article number or page range.

TS19
Please provide volume with article number and DOI or page range.

TS20
Please provide date of last access.

TS21
Please provide volume with article number or page range.

TS22
Please check DOI.

TS23
Please provide article number or page range.

TS24
Please list all authors.

TS25
Please provide volume with article number or page range.

TS26
Please confirm update of citation.

TS27
Please list all authors.

TS28
Please provide volume with article number or page range.

TS29
Reference not mentioned in text.

TS30
Please provide volume with article number or page range.

TS31
Please confirm update of citation.

TS32
Please provide date of last access.

TS33
Please provide volume with article number or page range.

TS34
Please provide volume with article number or page range.

TS35
Please provide date of last access.