Interactive comment on “ GPU accelerated atmospheric chemical kinetics in the ECHAM / MESSy ( EMAC ) Earth system model ( version 2 . 52 ) ” by Michail Alvanos and Theodoros Christoudias

The authors present the implementation and evaluation w.r.t. performance and accuracy of a source-to-source translation tool for KPP kernels in the EMAC earth system model. CUDA kernels and corresponding calling code is generated by parsing Fortran code generated by KPP. A representative benchmark is evaluated on 2 different platforms using 2 different generations of Nvidia GPUs. The performance impact of various optimizations is studied using microbenchmarks. Furthermore, the tool is released open source under a permissive MIT license and the specific release used to produce the manuscript is deposited on zenodo with a DOI, both of which are commendable.


Introduction
One of today's great scientific challenges is to predict how climate will change locally as gas concentrations change over time.The study of chemistry-climate interactions represents an important and, at the same time, difficult task of global Earth system research.The emerging issues of climate change, ozone depletion, and air quality, which are challenging from both scientific and policy perspectives, are represented in chemistry-climate models (CCMs).Understanding how the chemistry and composition of the atmosphere may change over the 21st century is essential in preparing adaptive responses or establishing mitigation strategies.
The global atmosphere-chemistry model ECHAM/ MESSy (EMAC) is a numerical chemistry and climate simulation system that includes submodels describing tropospheric and middle atmosphere processes and their interaction with oceans, land, and human influences (Jöckel et al., 2010).It uses the second version of the Modular Earth Submodel System (MESSy2) to link multi-institutional computer codes.The core atmospheric model is the fifthgeneration European Centre Hamburg general circulation model (Roeckner et al., 2006).The EMAC model runs on several platforms, but it is currently unsuitable for massively parallel computers due to its scalability limitations.In climate simulation applications, the numerical integration by chemical kinetics solvers can take up to 90 % of execution time (Christou et al., 2016).To achieve realistic simulation times, researchers are forced to limit the resolution of model simulations.
This paper describes a method of accelerating the chemical kinetics calculations on modern high-performance supercomputers using GPU accelerators.We present a sourceto-source parser, written in the Python programming lan-Published by Copernicus Publications on behalf of the European Geosciences Union.
guage, that transforms the MESSy chemical kinetics FOR-TRAN source code to CUDA source code, suited for running on CUDA-enabled general purpose graphics processing unit (GPGPU) accelerators.The parser transforms the autogenerated FORTRAN code by the KPP (Sandu and Sander, 2006;Damian et al., 2002) into the CUDA-compatible accelerated code, allowing to offload all different numerical integration solvers to GPU accelerators.The parser also makes the appropriate changes in the MESSy software distribution for linking the accelerated code during the compilation phase.
The paper is organized as follows: Sect.1.1 describes the MESSy/MECCA frameworks and the parallelization approaches, and Sect.1.2 the potential of GPU accelerators.Section 1.3 discusses previous work that is related to this research.In Sect.2, we present our implementation of atmospheric chemical kinetics parallelization, the source-tosource parser code, and GPU-specific optimizations, including memory optimizations, the control code restructuring, and the refactoring of the EMAC source code.An evaluation of the resulting GPU-accelerated climate model appears in Sect.3. We summarize the main outcomes and present our conclusions and planned future work in Sect. 4.

The EMAC framework
The numerical global atmosphere-chemistry model EMAC (ECHAM/MESSy Atmospheric Chemistry) is a modular global model that simulates the chemistry and dynamics of the stratosphere and troposphere.The model includes different submodels for the calculation of concentrations in the atmosphere, their interaction with the ocean and land surfaces, and the anthropogenic influences.The EMAC model runs on several platforms, but it is currently unsuitable for massively parallel computers due to its scalability limitations and large memory requirements per core.
The MESSy submodel MECCA executes independently the gas-phase chemical kinetics because there are no dependencies between physical neighbors and no limitations by vertical adjacency relations.In a typical configuration of MESSy with 155 species and 310 chemical reactions, MECCA takes 70 % of the simulation execution time (Christou et al., 2016).The percentage of execution time can go up to 90 % in simulations with more complex chemistry.
Currently, EMAC uses coarse-grained parallelism based on the Message Passing Interface (MPI).However, the current approach does not benefit from the accelerators that exist in modern hybrid high-performance computing (HPC) architectures.This puts severe limitations on current global climate time-length atmospheric chemistry and pollution transport simulations in terms of portability, complexity, and resolution.
EMAC uses the Kinetic PreProcessor (KPP) (Sandu and Sander, 2006;Damian et al., 2002) open-source general analysis tool to formulate the chemical mechanism.KPP inte-grates very efficient numerical analysis routines and automatically generates FORTRAN and C code that computes the time evolution of chemical species from a specification of the chemical mechanism in a domain-specific language.Taking a set of chemical reactions and their rate coefficients as input, KPP generates code of the resulting coupled ordinary differential equations (ODEs).Solving the ODE system allows the temporal integration of the kinetic system.Efficiency is obtained by exploiting the sparsity structures of the Jacobian matrix.
The biggest challenge to address in the application is the imbalance caused by the adaptive time-step integrator solving the differential equations that describe the chemical equations computed.The varying light intensity at sunrise and sunset in combination with concentrations of precursors and gases (such as NO x and O 3 ) leads to photochemical reactions (mostly over midlatitudes in the stratosphere) that heavily alter the stiffness of the ODEs.For example, Fig. 1a presents the cumulative number of execution steps required for the integration process for each model column, and Fig. 1b presents the maximum difference in the number of steps between cells in each column.The difference in the number of steps inside and between columns provides an indication of the strong imbalance created between execution times of different processes.
The ECHAM atmospheric dynamical circulation phase of EMAC only scales up to approximately a few hundred cores (Christou et al., 2016) due to the heavy all-to-all communication overhead of the spectral decomposition.At higher levels of parallelism, at or beyond approximately 1000 cores, the MECCA load imbalance due to the photochemistry also becomes a limiting factor.

GPU accelerators
A common trend in computing today is the utilization of hardware accelerators that efficiently execute codes rich in data parallelism to form high-performance heterogeneous systems.Graphical processing units (GPUs) are commonly used as accelerators due to high peak performance offered.GPU accelerators are module extensions connected using an interconnect to the CPU and memory subsystem.GPU accelerators feature on-board memory that provides highbandwidth, medium-latency access to data.In contrast with the general purpose CPUs, the GPU processors are designed for massive parallelism by issuing thousands of simple parallel instructions.Programming a GPU accelerator can be a hard and error-prone process that requires specially designed programming models, such as the CUDA (NVidia, 2015b) and OpenCL (Munshi, 2009).To unlock the potential performance of accelerators, programmers often apply incremental optimizations to their applications.The adaptive time-step integrator shows a non-uniform runtime caused by stratospheric photochemistry and natural and anthropogenic emissions.

Related developments
There are numerous efforts documented in the literature to improve the performance of climate-chemistry model simulations, specifically targeting chemical kinetics.
The adoption of intrinsic optimizations for sparse linear algebra (Zhang et al., 2011;Jacobson and Turco, 1994) is one of the first improvements implemented to speed up simulations of chemical kinetics.In a chemical kinetic solver, the majority of the time is spent solving the linear systems using implicit integration mechanisms.Only a fraction of the matrices have values different than zero, allowing for more efficient implementations using the sparsity of the structures.The sparse structure depends on the chemical reactions, and thus the linear algebra can be solved offline before the execution of the application.In a typical chemical mechanism, the pattern of chemical interactions leads to a sparse Jacobian with the majority of entries equal to zero.The KPP uses sparse linear algebra to reduce the execution time.The sparsity structure depends only on the chemical network and not on the values of concentrations or rate coefficients.Thus, the lookup tables are constant, allowing to KPP to unroll the loops and remove the indirect memory accesses.
Researchers have also improved the performance of the chemical kinetics portions of the code by using data-and instruction-level parallelism (DLP and ILP) when solving the chemical kinetics equations.Approaches include the introduction of single instruction, multiple data (SIMD) instructions and annotation using the OpenMP programming model.However, these approaches often rely on the ability of the compiler to auto-vectorize the code that very often misses opportunities (Zhang et al., 2011), and the OpenMP implementation heavily burdens the performance due to high overhead of scheduling and managing threads.
A more coarse-grained approach is to use grid-or boxlevel parallelization (Linford, 2009(Linford, , 2010;;Christoudias and Alvanos, 2016).The application breaks the grid or box into cells, allowing the calculation of concentrations indepen-dently between cells.Thus, the application assigns an equal number of cells to each thread to allow embarrassingly parallel execution of the chemical solvers.The biggest drawback of this approach is the limited parallelism and the imbalance that is created due to the photochemical processes in the lower stratosphere.A similar approach is used for the parallelization of chemical kinetics in the cell processor and GPU accelerators by Damian et al. (2002).However, these approaches exhibit the same limitations: limited parallelism and imbalance.Moreover, it is necessary for researchers to effectively rewrite their applications every time they run on the accelerators.Current commercial approaches of parallelizing the chemical kinetics use fine-grained parallelism that is suitable only when the number of elements and chemical reactions are complex enough to justify the use of accelerators.
A different organization of the workload using adaptive mesh refinement (AMR) was proposed as a way to increase locally the resolution for an area of interest (Keppens et al., 2003).The adaptive unstructured mesh representation in climate models can improve the overall performance of the application (Schepke and Maillard, 2012;Schepke, 2014).Compared to traditional horizontal partitioning, these solutions offer greater scalability.However, the communication demands are dominant in a high number of processes and the potential imbalance created by the chemical reaction and the heterogeneity of the modern HPC machines are ignored by this approach.
An earlier prototype of the application in this paper is outlined in Christoudias and Alvanos (2016), focusing on the challenges of using GPU accelerators to exploit node-level heterogeneity.This paper significantly expands on the previous work, both in detailed implementation and optimization.Moreover, this paper presents a performance evaluation of the first public release of the source code (Alvanos and Christoudias, 2017a).

Implementation
The section presents the implementation of the source-tosource parser and the challenges addressed.The parser is written in the Python programming language and generates CUDA (NVidia, 2015b) compatible solvers, by parsing the auto-generated FORTRAN code output by the KPP.This approach avoids the distribution of additional compiler framework, such as the ROSE compiler framework (Quinlan and Liao, 2011;Quinlan et al., 2012), that may contain additional software dependencies under a different license, and at the same time allows easy distribution and maintainability.The process is automated and allows the parser to work with all possible user-defined chemical mechanisms, without requiring changes by the end user on the accelerated kernel code.The FORTRAN compiler links the output CUDA object file with the rest of the application.
The user executes the parser from the messy/util directory to transform the code.The parser modifies the messy/smcl/messy_mecca_kpp.f90file and places a single call to the CUDA source file that contains the accelerated code (messy/smcl/messy_mecca_kpp_acc.cu) and a wrapper function for issuing the parallel kernels and copying the data to and from the GPU.The solver supports all five variations of Rosenbrock solvers available in KPP (Ros2, Ros3, Ros4, Rodas3, and Rodas4).The parser is also responsible for making the changes to the EMAC source makefile for linking of object files during compilation.The user must modify the configuration of the EMAC model to include the CUDA runtime during linking phase.Similar to the FORTRAN implementation, the computation is subdivided in runtime-specified arrays of columns.The memory of each array is transferred to the GPU global memory and each grid box is calculated on individual GPU threads.
The CUDA chemical kinetics solver comprises three steps, also presented diagrammatically in Fig. 2. Each task is offloaded using three different calls: 1.The first step is the calculation of the reaction rate coefficients.The variable values are stored in a global array inside the GPU and used in the second step.
2. The second step is the ODE solver that includes all linear algebra functions.The computational kernel contains five optional variations of the Rosenbrock family of numerical solvers, specified by the user at runtime, as supported by KPP/KP4.
3. The third and final step of the solver is the statistical reduction of the results, which demands very limited computational time compared with the other steps.
The parser replaces the code of integrator loops located in the messy_mecca_kpp.f90file with a single call to the CUDA-accelerated code.The CUDA source file includes an entry function that is responsible for moving the data to and from the GPU accelerator and issuing the computation kernels.Algorithm 1 presents the outline of the integrator code and Algorithm 2 presents the outline of the GPU glue code.Each GPU thread calculates the chemical concentrations of an individual cell.The temporary arrays used between steps of the solver are allocated in the stack memory, with the exception of the RCONST array.The RCONST array contains the reaction rate coefficients that are recalculated in the first step of the integration process or at each substep of the numerical solver.The developed source-tosource parser automatically recognizes the coefficients that are recalculated in the first step of the integration process or at each substep of the numerical solver for an ODE system with varying (rate) coefficients, which has to be integrated as a non-homogeneous system.
to stack occupancy by each CUDA thread.If the number of available registers increases or the local memory becomes larger, then the performance gain from coalescing memory access can be significant.The trend in the GPU architectures is the increase of the on-chip local memory.
The global GPU memory size suffices for offloading the full set of data, and it is not a limiting factor for the present and future projected nominal chemistry mechanism complexity.The performance is also not limited by the overuse of any function unit.Global and local memory are used to store all required data, and local variables are used for each grid box for temporary variables during computation, with cache misses and register usage being the current performance lim-iting factor.The block size of the GPU kernels is set to 64, and it can be changed to 128 for increased efficiency for future GPU architectures if applicable.The maximum number of cells that can be offloaded to the GPU is 12 288.This sets an upper limit to the value of the ECHAM5 NVL runtime parameter at 128, assuming 90 levels for the atmosphere.
The accelerated CPU process requires a chunk of the GPU video RAM (VRAM) memory, whose size is dependent on the number of species and reaction constants in the MECCA mechanism.The required stack memory can be calculated as follows: stack_allocated_memory = (number_of_elements× 13+lu_nonzero×2+spilled variables)×sizeof(double). For example, in a workload with 155 species and 310 re-  actions, the required stacked memory including the temporary allocate arrays is (155 × 13 + 2 × 1500 + 1200) × 8 = 40 120 approximately.Note that the size of spilled variables depends on the complexity of the source code.The total memory required for a kernel is calculated for 12 288 cells is 12 288 × (stack_allocated_memory + global_allocated_mem) = 12 288×52 000 ≈ 638 MB.Moreover, during the GPU code loading, the GPU driver unrolls the loops and expands the offloaded code making it a limiting factor for allocated memory.For instance, the aforementioned chemistry workload requires at least 2 GB of VRAM to be available on the Fermi architecture or more than 6 GB in the newest Pascal architecture.We believe that the additional allocated memory is due to the driver side allocation of additional caches and buffers to achieve the best performance.Furthermore, to support the newest features, such as the Unified Virtual Memory model, additional memory allocation is required.Thus, the total memory required for each process depends not only on the size of chemistry but also in the complexity of the source code (number of reactions) and the architecture of the GPU accelerator.

Challenges and optimizations
To achieve the best possible performance and efficiency three challenges had to be addressed.First, the biggest limitation in the performance of the chemical kinetics kernel is the mem-ory required.To solve the ODEs and calculate the new concentrations, more memory is required in total than the available on-chip memory.For instance, a set of 155 species and 310 reactions requires the use of ∼ 50 KB of stack memory per cell.Thus, running multiple threads concurrently on the GPU forces the use of global memory for storing the data.Second, the source code complexity of the solver kernel increases the usage of registers and limits the GPU streaming multiprocessor (SM) utilization.Third, the number of steps for solving the differential equations differs between different cells.This creates thread divergence, limiting the performance.
To address these challenges, we implemented incremental improvements in the source code running on the GPUs.Our aim is to improve the GPU occupancy, reduce the memory impact, simplify the source code, and reduce the idle time.

Occupancy improvement
The high register pressure limits the occupancy, limits the number of concurrent threads per SM, and increases the overall execution time of the kernel.There are two ways of overcoming this challenge: either placing a flag during compilation that limits the number of registers or using the launch bound qualifier (__launch_bounds__) to a specific kernel.Our implementation uses the second option as a generic, future-proof approach.The downside of limiting the register CPU only (mol mol -1 ) Accelerated (mol mol -1 ) Relative di®erence (%) usage is the increase of register spilling, causing an increase of stack memory usage.The compiler allocates additional stack memory for spill loads and stores, creating additional local memory traffic that does not fit into the on-chip memory.Thus, the application execution time is dominated by the global memory access latency.

Memory optimizations
GPU accelerators employ a wide memory bus that allows for high-bandwidth transfers at the cost of high latency.To achieve the best execution efficiency on the GPU, we apply a number of memory optimizations: (i) better utilization of the on-chip memory, (ii) privatization of each thread data structure, and (iii) prefetching.
Each SM contains a small amount of local memory that can be used as shared memory between threads or for Level-1 (L1) cache.The size of temporary matrices is larger than the available on-chip memory.Thus, only a small portion of the memory can be used to store the data.In particular, the concentrations of chemical species that remain unchanged by chemical kinetics are stored in this memory.To increase the utilization of the local memory, we increased the portion of the L1 cache against the shared memory through the cudaFuncCachePreferL1 runtime call.
The solver must keep intermediate results in temporary arrays during different steps of the solver.The temporary arrays are larger than the available on-chip memory, forcing us to declare the arrays in global memory.Although accesses on these arrays can be coalesced, there is still overhead occurring due to cache misses.Moreover, different execution paths of the kernel can complicate and limit the coalescing of the data.To overcome this, we privatized the majority of the matrices by either replacing them with scalar variables or by using stack-allocated arrays, allowing simplified temporary array indexing code.
The last modification of the memory optimizations is to employ prefetching of data.Prefetching may lead to unpredictable behavior, especially in massively parallel architectures, such as GPUs.There are four possible prefetch instructions that can be applied to the chemical kinetics.Microbenchmarks showed better results by using the prefetch.global.L1 and prefetch.local.L1 inline Parallel Thread Execution (PTX) assembly for fetching the data to the L1 caches.

Control code simplification
The GPU cores are relatively simple units and their computational efficiency depends on the absence of control code.To increase the instruction-level parallelism and avoid possible branch divergence, the implementation adopts three commonly used techniques: (i) use of lookup tables, (ii) unrolling loops, and (iii) branch elimination.The lookup tables are used for selecting the solver and setting the proper values in specific variables.The benefits of loop unrolling are most profound in the preparation of the Rosenbrock solver ROS when using the sparse matrix.Finally, limited branch elimination by fusing loops or merging branches also improves the execution time.

Decreasing the GPU idle time
The biggest challenge when using accelerators in hybrid HPC architectures is the imbalance created by the uneven workload of tasks.While the application uses GPUs as accelerators, only the CPU cores responsible for communication and handling of GPUs are active, leaving the remaining cores idle.Allocating more processes on the unused CPU core creates more imbalance between the accelerated processes and the CPU-only processes.To address this, the goal is to increase the GPU utilization by assigning more than one GPU per process.
The are two ways to do this: (i) over-subscription and (ii) using the Multi-Process Service (MPS) (Nvidia, 2015a).The first way is to allow more than two CPU cores (MPI processes) from one node to be offloaded to the accelerators.The downside of this approach is that only one process can access the GPU (exclusive access).Although there will be some benefit compared to the case of using a single process, it is possible that some of the tasks will underutilize the available hardware.The number of GPUs per node and VRAM memory available in each GPU dictate the total number of CPU cores that can run simultaneously.An alternative approach is to use the MPS that allows concurrent execution of kernels and memory transfers from different processes on the same node.The latter requires a GPU accelerator with compute capability of 3.5 or higher.

Future GPU kernel optimizations
The execution time of individual chemical kinetics tasks depends on the input data of each cell grid.Different task execution times create an imbalance between tasks running on the same GPU.A promising approach to address this challenge is the use of dynamic parallelism: the ability for each GPU thread to call other kernels and spawn more GPU threads to increase the available concurrency.

Results and evaluation
This section presents (i) the total impact on simulation time and (ii) the accuracy and correctness of the accelerated model.Three different hardware and software environments were used to measure the impact of the code acceleration (Table 1): -The first was an iDataPlex dx360 M3 compute node that contains two Intel ® Xeon ® X5650 six-core processors running at 2.6 GHz coupled with two Tesla M-series GPU M2070 accelerators (Fermi architecture).The application is compiled using the Intel compiler (ifort ver.14.0.2) for improved native performance.
-The second was a JURECA (Jülich Research on Exascale Cluster Architectures) computation node that contains two Intel ® Xeon ® 12-core E5-2680 v3 processors running at 2.5 GHz coupled with two Tesla K80 (Kepler architecture) accelerators.The application is compiled using the Intel compiler (ifort ver.16.0.4)for improved native performance.The runtime environment during execution includes -cpus-per-task=2 to schedule the execution between actual CPU threads and cores.
-The third was an IBM ® S822LC compute node equipped with two 10-core 2.92 GHz POWER8 processors (Fluhr et al., 2014;Stuecheli, 2013) with turbo up to 4 GHz.Simultaneous multithreading is set to 4 for optimum performance in HPC applications.The application is compiled using the IBM compiler (xlf ver. 15.1.5).The execution of the mpirun command includes -map-by L2cache -bind-to core:overload-allowed to reduce the contention of the cache and function units.
To test the model scalability within a compute node, the evaluation uses a representative benchmark simulation with a horizontal resolution of 128 grid points in the longitudinal direction and 64 grid points in the latitudinal direction with 90 vertical levels.T42L90MA Ordinary differential equations solver Ros3: three-stage, L-stable pair of order 3(2) (Sandu et al, 1997) chemistry batch option is used, along with model namelist setup NML_SETUP=E5M2/02b_qctm, without dynamics nudging and with the diagnostic submodels D14CO, DRADON, S4D, and VISO switched off.The simulation runs with the SCAV (scavenging) submodel disabled to reduce deviations for soluble species whose aqueous-phase chemistry is solved with KPP1 by the submodel SCAV.

Application performance
This section compares the performance of the CPU-only and the GPU-accelerated versions of the application.The evaluation uses only one node to avoid any MPI communication interference in the results and limits the period of simulated time to 24 h.The execution time does not include the initialization of the application and the file I/O operations.
The accelerated version of the kernel achieves an order of magnitude performance improvement over the CPU-only version, as shown in Table 3.We note that the theoretical performance difference between the CPU core and the accelerator is larger in the P100 platform.The performance improvement of the kernel can provide an indication of the expected application production performance improvement.
Table 4 presents the execution time in seconds and the gain in performance between the GPU-accelerated and CPU-only versions for the two platforms.The node-to-node comparison allows us to compare the accelerated version with the best performance achieved using the CPU-only version, using all available cores.The performance gain of the application differs between the available platforms.The M2070 accelerator unit contains only 6 GB of memory, limiting the number of processes that can be offloaded in parallel to the accelerator.Moreover, the M2070 accelerator does not support the MPS (Nvidia, 2015a).Thus, in this experiment, we use the CPU to run the remaining processes instead of accelerators.Despite the significant 2.27× performance gain over execution with two processes, the performance gain over using the entire node is limited.The accelerated processes complete each time step before the CPU-only processes, causing severe load imbalance between different MPI processes.Thus, the benefit of having additional accelerators is not reflected in the total attainable performance and an advanced scheduling approach is required between the CPU cores and accelerators.
On the K80 and P100 platforms, the application uses the MPS to interact with the accelerators using more than two MPI processes.In the POWER8 platform, the best CPU-only performance is achieved using over-subscription to the CPUs by using two MPI processes assigned to physical cores.The accelerated version achieves 1.75× performance gain over the best-performing CPU-only configuration, clearly showing the benefit of accelerating the chemical kinetics on GPUs.The accelerated version with 40 MPI processes achieves lower performance than the 20 MPI processes (not shown).

Numerical and model accuracy
The performance gain in ODE systems with varying rate coefficients is greater due to the recalculation of the coefficients inside the solver.The portion of accelerated workload increases in the case of a non-homogeneous system, resulting in greater performance gain when GPU accelerators are used due to the massively parallel architecture.Table 5 shows the results for 24 h simulated time using the M2070 and K80 accelerators.The achieved performance gain increases from 1.19× to 1.56× when recalculating RCONST at each substep.The changes required for the automatic recalculation of RCONST at each substep are included under the dev branch of the source-to-source parser.
To validate the correctness and evaluate the accuracy of the accelerated implementation, we calculate the numerical and modeling difference over 2 years of simulation time.The expected level of relative accuracy (relative numerical dif-  ference) for the chemical kinetics calculation is 0.1 %, i.e., about three accurate digits (Zhang et al., 2011).To calculate the relative difference, we compare the output of chemical element concentrations between the CPU-only and GPUaccelerated versions after the first time step.The results show a median difference less than 0.000000001 % with the maximum difference value depending on the number of iterations for solving the equations.This is well within the accuracy criterion, asserting the numerical correctness of the GPU kernel.
The variance in floating point results between different architectures is well known and observed in scientific applications that require high-precision floating point operations (Corden and Kreitzer, 2012;Langlois et al., 2015).The ifort compiler produces intermediate results stored with an extended precision that provide greater precision than the basic floating point formats of 64 bit.On the other hand, the GPU accelerators do not support the extended precision values, reducing the floating point accuracy in the results.Furthermore, the REAL( 16) declaration in FORTRAN programs is implemented in software for the POWER8 architecture.Despite the small difference in the results, the model remains stable when running over a 2-year simulation time period.
To examine the impact of the inherent data difference on the model accuracy, we compare the results of aggregated mass of chemical species between the CPU-only and GPUaccelerated versions over 2 years of simulation time.Figure 3 presents the chemical species aggregated mass difference distributions between the accelerated and non-accelerated versions.For simplicity, we include 19 important chemical species.The results show that the median value of the difference in aggregated mass is less than 5 % for 18 out of 19 species and higher for HCOOH.The differences are well within the expected margin of differences stemming from architecture and compiler implementations (not specific to GPU).The impact of scavenging amplifies the minor errors created by the different architecture.Figure 4 presents the relative difference after 1 year simulation with the SCAV module enabled.In this case, median values fall within the 5 % limits, with a wider range of extreme values compared to the simulation with the SCAV module disabled.The largest deviations from zero seem to be for soluble species whose aqueous-phase chemistry is very important and solved with KPP1 by the submodel SCAV.The ODE system for aqueousphase chemistry is notoriously much stiffer than the one for the gas phase.
Figure 5 compares the output results of the CPU and GPU simulations and their relative difference for zonal mean O 3 and surface concentrations of SO 2 , NH 3 , and OH.The relative difference is calculated as the difference over the mean of the two simulations.The largest relative differences appear in areas with the lowest (close to zero) concentrations of chemical elements.The results show the correctness of the accelerated code and its numerical accuracy.

Conclusions
The  In this paper, we presented the acceleration of the KPP chemical kinetics solver in the EMAC chemistry-climate model using a source-to-source parser to transform the solver to CUDA-accelerated code.The parser supports all Rosenbrock family numerical solvers (Ros2, Ros3, Ros4, Rodas3, Rodas4) that are available in the KPP numerical library.A performance evaluation, using two CUDA-enabled accelerators shows an achieved speed-up of up to 20.4× the kernel execution time and up to 1.75× node-to-node application performance gain.A comparison of the aggregated global mass of 19 chemical elements between the CPU-only and GPU-accelerated versions of the application to verify the correctness of the transformation.The numerical accuracy was also ascertained with the relative difference found to be less than 5 % when comparing the output of the accelerated kernel to the CPU-only code for 19 chemical elements after 2 years of simulation time.
The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.This work is a marked step forward to increase the resolution of climate simulations using chemical kinetics and expand the usage of GPU accelerators in Earth system modeling.
Code availability.A consortium of institutions continuously develops the Modular Earth Submodel System (MESSy).The usage of MESSy and access to the source code are licensed to all affiliates of institutions which are members of the MESSy Consortium.Institutions can become a member of the MESSy Consortium by signing the MESSy Memorandum of Understanding.More information can be found on the MESSy Consortium website (http://www.messy-interface.org).
The FORTRAN to CUDA code-to-code compiler is developed in CUDA and Python and it is included in the EMAC model development release.In addition, the source code is included in public repository under open license to allow the developer community to contribute (Alvanos and Christoudias, 2017b;The Cyprus Intitute, 2016).It parses the auto-generated MECCA KPP solver code and produces a CUDA library that can be linked to and called directly from within the MESSy FORTRAN code at the time of compilation.In addition, the source code contains a test script to validate the correctness of the transformation and evaluate the performance using the five available Rosenbrock solvers.When executed, the test script (i) checks if the parser executes without errors, (ii) checks if the produced file compiles without errors, (iii) executes the application to detect runtime errors and memory violations, and (iv) compares the output of the accelerated version with the output of the serial FORTRAN version for possible differences of the output.

Figure 1 .
Figure 1.Number of integration steps for column MECCA kernel execution (a) and intra-column maximum difference in execution steps (b).The adaptive time-step integrator shows a non-uniform runtime caused by stratospheric photochemistry and natural and anthropogenic emissions.

Figure 2 .
Figure 2. Flow chart of the tasks' offload execution on GPU accelerators.

Figure 3 .
Figure 3. Aggregated mass difference between the accelerated and non-accelerated versions in the final month after 2 years of simulation time for different chemical species with the SCAV (scavenging) submodel disabled.

Figure 4 .
Figure 4. Aggregated mass difference between the accelerated and non-accelerated versions in the final month after 1 year of simulation time with the SCAV submodel enabled.

Figure 5 .
Figure 5. Output of CPU-only (a, d, j, g), GPU-accelerated (b, e, h, k) 2-year simulations, and their relative difference (c, f, i, l) for different chemical species: O 3 zonal mean (a-c) and surface level OH, NH 3 , and SO 2 concentrations (d-f, g-i, and j-l, respectively).

Table 1 .
Hardware configurations used for performance evaluation.

Table 2 .
Experimental configuration of the simulation.

Table 3 .
Median execution time and achieved speed-up for the non-accelerated (CPU-only) code on a single core and the accelerated version of the kernel for the three platforms.The extracted execution timing results are for the time step after the initialization of the simulation.

Table 4 .
Application execution time and achieved speed-up of the three node configurations for 24 h simulated time.

Table 5 .
Application execution time in seconds and achieved speed-up with/without recalculation of RCONST at each KPP substep.

Table B2 .
Results running the kernel on a K80 CUDA-enabled accelerator hosted in a node equipped with an Xeon E5-2680 v3 processor at 2.5 GHz.