Lagrangian models are fundamental tools to study atmospheric transport processes and for practical applications such as dispersion modeling for anthropogenic and natural emission sources. However, conducting large-scale Lagrangian transport simulations with millions of air parcels or more can become rather numerically costly. In this study, we assessed the potential of exploiting graphics processing units (GPUs) to accelerate Lagrangian transport simulations. We ported the Massive-Parallel Trajectory Calculations (MPTRAC) model to GPUs using the open accelerator (OpenACC) programming model. The trajectory calculations conducted within the MPTRAC model were fully ported to GPUs, i.e., except for feeding in the meteorological input data and for extracting the particle output data, the code operates entirely on the GPU devices without frequent data transfers between CPU and GPU memory. Model verification, performance analyses, and scaling tests of the Message Passing Interface (MPI) – Open Multi-Processing (OpenMP) – OpenACC hybrid parallelization of MPTRAC were conducted on the Jülich Wizard for European Leadership Science (JUWELS) Booster supercomputer operated by the Jülich Supercomputing Centre, Germany. The JUWELS Booster comprises 3744 NVIDIA A100 Tensor Core GPUs, providing a peak performance of 71.0 PFlop s
Lagrangian transport models are frequently applied to study chemical and dynamical processes of the Earth's atmosphere. They have important practical applications in modeling and assessing the dispersion of anthropogenic and natural emissions from local to global scale, for instance, for air pollution
A wide range of Lagrangian transport models has been developed for research studies and operational applications during the past decades
In this study, the Massive-Parallel Trajectory Calculations (MPTRAC) model is applied to exploit the potential of conducting Lagrangian transport simulations on graphics processing units (GPUs). MPTRAC was first described by
The idea of using specialized computation units for scientific computing goes back to the early 1980s when co-processors like Intel's 8087 and 8231 were introduced. It is more than 20 years since graphics processing units (GPUs) were leveraged for non-graphical general-purpose calculations
Large-scale and long-term Lagrangian transport simulations for climate studies or inverse modeling applications can become very compute-intensive
GPUs bear the potential to not only calculate the solutions of Lagrangian transport problems more quickly but also obtain them in a much more energy-efficient manner. For this study, we ported our existing Lagrangian transport model MPTRAC to GPUs by means of the open accelerator (OpenACC) programming model. Next to offloading calculations to GPUs, the code is also capable of distributing computing tasks employing the Message Passing Interface (MPI) over the compute nodes and the Open Multi-Processing (OpenMP) over the CPU cores of a heterogeneous supercomputer. A detailed evaluation and performance assessment of MPTRAC on GPUs was conducted on the Jülich Wizard for European Leadership Science (JUWELS) system
Lagrangian transport simulations are often driven by global meteorological reanalyses or forecast data sets. The MPTRAC model has been used with the National Centers for Environmental Prediction and National Center for Atmospheric Research (NCEP/NCAR) reanalysis 1
We provide a comprehensive description of the MPTRAC model in Sect.
Figure
Call graph of the most relevant functions of the MPTRAC model. The individual functions are sorted following the IPO approach; see text for details. Black boxes highlight parts of the code that are ported to GPUs.
Three main input functions are available, i.e.,
The processing functions of MPTRAC provide the capabilities to calculate kinematic trajectories of the particles using given
Finally, the model output is directed via a generic function (
All the processing functions (
In order to enable trajectory calculations, MPTRAC requires an input file providing the initial positions of the air parcels. The initial positions are defined in terms of time, log-pressure height, longitude, and latitude. Internally, MPTRAC applies pressure as its vertical coordinate. For convenience, the initial pressure
Internally, MPTRAC uses the time in seconds since 1 January 2000, 00:00 UTC as its time coordinate. Tools are provided along with the model to convert the internal time from and to UTC time format. The internal time step
Unless the initial positions of the trajectories are derived from measurements, a set of tools provided with MPTRAC can be used to create initial air parcel files. The tool
A tool to modify air parcel positions is
Next to the particle positions, the other main input data required by MPTRAC are meteorological data from a global forecasting or reanalysis system. At a minimum, MPTRAC requires 3-D fields of temperature
The model requires the input data in terms of separate NetCDF files at regular time steps. The meteorological data have to be provided on a regular horizontal latitude
Once the mandatory and optional meteorological input variables have been read in from disk, MPTRAC provides the capability to calculate additional meteorological variables from the input data. This includes the options to calculate geopotential heights, potential vorticity, the tropopause height, cloud properties such as cloud layer depth and total column cloud water, the convective available potential energy (CAPE), and the planetary boundary layer (PBL). Having the option to calculate these additional meteorological data directly in the model helps to reduce the disk space needed to save them as input data. This is particularly relevant for large input data sets such as ERA5. It is also an advantage in terms of consistency, if the same algorithms can be applied to infer the additional meteorological variables from different meteorological data sets. The algorithms and selected examples on the meteorological data preprocessing are presented in the Supplement of this paper.
The upper boundary of the model is defined by the lowest pressure level of the meteorological input data. The lower boundary of the model is defined by the 2-D surface pressure field
We recognize that pressure might not be the best choice for the vertical coordinate of a Lagrangian model, in particular for the boundary layer, because a broad range of surface pressure variations needs to be represented appropriately by a fixed set of pressure levels. However, as the scope of MPTRAC is on applications covering the free troposphere and the stratosphere, we consider the limitation of having reduced vertical resolution in terms of pressure levels near the surface acceptable for now. The next release of MPTRAC will allow the user to choose between pressure and the isentropic-sigma hybrid coordinate (
For global meteorological data sets, MPTRAC applies periodic boundary conditions in longitude. During the trajectory calculations, air parcels may sometimes cross either the pole or the longitudinal boundary of the meteorological data. If an air parcel crosses the North or South Pole, its longitude will be shifted by 180
In MPTRAC, 4-D linear interpolation is applied to sample the meteorological data at any given position and time. Although higher-order interpolation schemes may achieve better accuracy, linear interpolation is considered a standard choice in Lagrangian particle dispersion models
Following the memory layout of the data structures used for the meteorological input data in our code and to make efficient use of memory caches, the linear interpolations are conducted first in pressure, followed by latitude and longitude, and finally in time. Interpolations can be conducted for a single variable or all the meteorological data at once. Two separate index look-up functions are implemented for regularly gridded data (longitude and latitude) and for irregularly gridded data (pressure). Interpolation weights are kept in caches to improve efficiency of the calculations.
For various studies, it is interesting to investigate the sensitivity of the results with respect to the spatial and temporal resolution of the meteorological input data. For instance,
Note that downsampling has not been implemented rigorously for the time domain. Nevertheless, the user can specify the time interval at which the meteorological input data should be ingested. If filtering of the downsampled data is required, downsampling in time needs to be done externally by the user by applying tools such as the Climate Data Operators (CDO)
The advection of an air parcel, i.e., the position
Calculations of the numerical solution of the trajectory equation require coordinate transformations between Cartesian coordinate distances
The coordinate transformation between
As an example, Fig.
Trajectory calculations for the Northern Hemisphere subtropical jet (case A) and the Southern Hemisphere polar jet (case B) from 1 January 2017, 00:00 UTC, to 9 January 2017, 00:00 UTC. Trajectories were launched at positions A and B at log-pressure altitudes of 11.25 and 8.5 km, respectively. Calculations used hourly ERA5 horizontal winds and vertical velocity data for input. Black curves show trajectories calculated without diffusion. Black dots indicate 24 h time intervals. Colored dots show calculations for two sets of 1000 trajectories for cases A and B with turbulent diffusion and subgrid-scale wind fluctuations being considered.
Rather complex parametrizations of atmospheric diffusivity are available for the planetary boundary layer
In addition to turbulent diffusion, the effects of unresolved subgrid-scale winds, also referred to as mesoscale wind perturbations, are considered. The starting point for this approach is the separation of the horizontal wind and vertical velocity vector
Figure
The contour surfaces show the zonal mean standard deviations of
Figure
Absolute horizontal transport deviations (AHTDs,
Note that several studies with Lagrangian models provided estimates of atmospheric diffusivities
The spatial resolution of global meteorological input data is often too coarse to allow for explicit representation of convective up- and downdrafts. Although the downdrafts may occur on larger horizontal scales, the updrafts are usually confined to horizontal scales below a few kilometers. Furthermore, convection occurs on timescales of a few hours; i.e., hourly ERA5 data may better represent convection than 6-hourly ERA-Interim data. A parametrization to better represent unresolved convective up- and downdrafts in global simulations was implemented in MPTRAC, which is similar to the convection parametrization implemented in the Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) and Stochastic Time-Inverted Lagrangian Transport (STILT) models
The convection parametrization requires as input the convective available potential energy (CAPE) and the equilibrium level from the meteorological data. These data need to be interpolated to the horizontal position of each air parcel. If the interpolated CAPE value is larger than a threshold CAPE
The globally applied threshold CAPE
The extreme convection parametrization implemented here is arguably a rather simple approach, as it relies only on CAPE and the equilibrium level from the meteorological input data. Nevertheless, first tests showed that this parametrization significantly improves transport patterns in the free troposphere. We also conducted experiments to further improve the parametrization by considering additional parameters such as the convective inhibition (CIN) to improve the onset of convection. More sophisticated convection parametrizations have been developed and implemented in Lagrangian transport models during recent years
In order to take into account the gravitational settling of particles, the sedimentation velocity
Figure
Sedimentation velocities of spherical particles in the Stokes–Cunningham regime for different particle sizes, atmospheric pressure levels, and midlatitude mean temperatures.
Wet deposition causes the removal of trace gases and aerosol particles from the atmosphere within or below clouds by mixing with suspended water and following washout through rain, snow, or fog. In the first step, it is determined whether an air parcel is located below a cloud top. The cloud-top pressure
In the second step, the wet deposition parametrization determines an estimate of the subgrid-scale precipitation rate
In the third step, it is inferred whether the air parcel is located within or below the cloud because scavenging coefficients will be different under these conditions. The position of the air parcel within or below the cloud is determined by interpolating the cloud water content to the position of the air parcel and by testing whether the interpolated values are larger than zero.
In the fourth step, the scavenging coefficient
Henry's law constants for water as a solvent. See
Finally, once the scavenging coefficient
Dry deposition leads to a loss of mass of aerosol particles or trace gases by gravitational settling or chemical and physical interactions with the surface in the dry phase. In the parametrization implemented in MPTRAC, dry deposition is calculated for air parcels located in the lowermost
For aerosol particles, the deposition velocity
For both particles and gases, we calculated the loss of mass based on the deposition velocity
In this section, we discuss the MPTRAC module that is used to simulate the loss of mass of a chemical species by means of reaction with the hydroxyl radical. The hydroxyl radical (OH) is an important oxidant in the atmosphere, causing the decomposition of many gas-phase species. The oxidation of different gas-phase species with OH can be classified into two main categories, bimolecular reactions (e.g., reactions of CH
For bimolecular reactions, the rate constant is calculated from Arrhenius law,
Bimolecular reaction rate coefficients for the hydroxyl radical. See
Termolecular reactions require an inert component
Termolecular reaction rate coefficients for the hydroxyl radical. See
Based on the bimolecular reaction rate
A rather generic module was implemented in MPTRAC, to simulate the loss of mass of an air parcel over a model time step
Finally, a module was implemented to impose constant boundary conditions on particle mass or volume mixing ratio. At each time step of the model, all particles that are located within a given latitude and pressure range can be assigned a constant mass or volume mixing ratio as defined by a control parameter of the model. Next to specifying fixed pressure ranges in the free troposphere, stratosphere, or mesosphere to define the boundary conditions, it is also possible to specify a fixed pressure range with respect to the surface pressure, in order to define a near-surface layer.
The boundary condition module can be used to conduct synthetic tracer simulations. For example, the synthetic tracer E90
Another example of a synthetic tracer is ST80, which is defined by a constant volume mixing ratio of 200 ppb above 80 hPa and a uniform, fixed 25 d
For diagnostic purposes, it is often necessary to obtain meteorological data along the trajectories. However, typically only a limited subset of the meteorological variables is of interest. Therefore, we introduced the concept of “quantities”, allowing the user to select, which variables should be calculated and stored in the output data files. Next to selecting the specific types of model output that are desired, also a list of the specific quantities that should be provided as output needs to be specified via control parameters. In total, about 50 output variables are implemented in MPTRAC (Table
List of quantities defined in MPTRAC.
A distinct module was implemented into MPTRAC that is used to sample the meteorological data along the trajectories. This module can become demanding in terms of computing time, as it requires the interpolation of up to 10 3-D variables and 16 2-D variables of meteorological data that are either read in from the meteorological input data files or calculated during the preprocessing of the meteorological data. However, the computing time needs of this module are typically strongly reduced by the fact that meteorological data along the trajectories are usually not required at every model time step but only at user-defined output intervals.
Next to the option of sampling of the meteorological data along the trajectories, we also implemented tools to provide direct output of the meteorological data. This includes tools to extract maps, zonal mean cross sections, and vertical profiles on pressure or isentropic levels. These tools allow for time averaging over multiple meteorological data files. Another tool is available to sample the meteorological data based on a list of given times and positions, without involving any trajectory calculations. Altogether, these tools provide a great flexibility in exploiting meteorological data in many applications.
At present, MPTRAC offers seven output options, referred to as “atmospheric output”, “grid output”, “CSI output”, “ensemble output”, “profile output”, “sample output”, and “station output”. The most comprehensive output of MPTRAC is the atmospheric output. Atmospheric output files can be generated at user-defined time intervals, which need to be integer multiples of the model time step
As the atmospheric output can easily become too large for further analyses, in particular if many air parcels are involved, the output of gridded data was implemented. This output will be generated by integrating over the mass of all parcels in regular longitude
Another type of output that we used in several studies
Another option to condense comprehensive particle data is provided by means of the ensemble output. This type of output requires a user-defined specific ensemble index value to be assigned to each air parcel. Instead of the individual air parcel data, the ensemble output will contain the mean positions as well as the means and standard deviations of the quantities selected for output for each set of air parcels having the same ensemble index. The ensemble output if of interest, if tracer dispersion from multiple point sources needs to be quantified by means of a single model run, for instance.
The profile output of MPTRAC is similar to the grid output as it creates vertical profiles from the model data on a regular longitude
The sample output of MPTRAC was implemented most recently. It allows the user to extract model information on a list of given locations and times, by calculating the column density and volume mixing ratio of all parcels located within a user-specified horizontal search radius and vertical height range. For large numbers of sampling locations and air parcels, this type of output can become rather time-consuming. It requires an efficient implementation and parallelization because it needs to be tested at each time step of the model whether an air parcel is located within a sampling volume or not. The numerical effort scales linearly with both the number of air parcels and the number of sampling volumes. The sample output was first applied in the study of
Finally, the station output is collecting the data of air parcels that are located within a search radius around a given location (latitude, longitude). The vertical position is not considered here; i.e., the information of all air parcels within the vertical column over the station is collected. In order to avoid double counting of air parcels over multiple time steps, the quantity STAT (Table
By default, all output functions of MPTRAC create data files in an ASCII table format. This type of output is usually simple to understand and usable with many tools for data analysis and visualization. However, in the case of large-scale simulations, it is desirable to use more efficient file formats. Therefore, an option was implemented to write particle data to binary output files. Likewise, reading particle data from a binary file is much more efficient than from an ASCII file. The binary input and output is an efficient way to save or restore the state of the model during intermediate steps of a workflow. Another interesting option of output is to pipe the data directly from the model to a visualization tool. This will keep the output data in memory and directly forward it from MPTRAC to the visualization tool. This option has been successfully tested for the particle and the grid output in combination with the graphing utility
For diagnostic purposes, it is often helpful to calculate trajectory statistics. The tool
The tool
Next to AHTDs and AVTDs, the tool
Lagrangian particle dispersion models are well suited for parallel computing as large sets of air parcel trajectories can be calculated independently of each other. The workload is “embarrassingly parallel” or “perfectly parallel” as little to no effort is needed to separate the problem into a number of independent parallel computing tasks. In this section, we discuss the MPI–OpenMP–OpenACC hybrid parallelization implemented in MPTRAC. The term “hybrid parallelization” refers to the fact that several parallelization techniques are employed in a simulation at the same time.
MPI is a communication protocol and standardized interface to enable point-to-point and collective communication between computing tasks. MPI provides high performance, scalability, and portability. It is considered as a leading approach used for high-performance computing. The OpenMP application programming interface supports multi-platform shared-memory multiprocessing programming for various programming languages and computing architectures. It consists of a set of compiler directives, library routines, and environment variables that are used to distribute the workload over a set of computing threads on a compute node. An application built with an MPI–OpenMP hybrid approach can run on a compute cluster, such that OpenMP can be used to exploit the parallelism of all hardware threads within a multi-core node while MPI is used for parallelism between the nodes. Open accelerators (OpenACC) is a programming model to enable parallel programming of heterogeneous CPU–GPU systems. As in OpenMP, the source code of the program is annotated with “pragmas” to identify the areas that should be accelerated by using GPUs. In combination with MPI and OpenMP, OpenACC can be used to conduct multi-GPU simulations.
The GPU parallelization is a new feature since the release of version 2.0 of MPTRAC. Various concepts for employing GPUs are available for scientific high-performance computing. The simplest option is to replace standard computing libraries by GPU-enabled versions, which are provided by the vendors of the GPU hardware such as the NVIDIA company. A prominent example of such a library, which was ported to GPUs, is the Basic Linear Algebra Subprograms (BLAS) library for matrix–matrix, matrix–vector, and vector–vector operations. However, as MPTRAC does not employ the BLAS library, this simple parallelization technique was not considered helpful for porting our model to GPUs. At the other end of the options for GPU computing, the Compute Unified Device Architecture (CUDA) is a dedicated application programming model for GPUs. The CUDA programming model is often considered to be most flexible and allows for most detailed control over the GPU hardware. However, CUDA requires solid knowledge on GPU technology and it may cause error-prone and time-consuming work if legacy code needs to be rewritten
OpenACC can help to overcome the practical difficulties of CUDA and reduces the coding workload for developers significantly. We considered OpenACC to be a more practical choice for porting MPTRAC to GPUs as the code of the model should remain understandable and maintainable by students and domain scientists that may not be familiar with the details of more complex GPU programming concepts such as CUDA. By using OpenACC, we found that we were able to maintain the same code base of the MPTRAC model rather than having to develop a CPU and a GPU version of the model independently of each other. Next to easiness in terms of maintenance and portability of the code, the common code base is considered a significant advantage to check for bit-level agreement of CPU and GPU computations and reproducibility of simulation results.
Algorithm 1 illustrates the GPU porting of the C code of MPTRAC by means of OpenACC pragmas. Two important aspects need to be considered. First, the pragma
The second aspect of GPU porting by means of OpenACC is concerned with the data management. In principle, the data required for a calculation need to be copied from CPU to GPU memory before the calculation and need to be copied back from GPU to CPU memory after the computation finished. Although NVIDIA's “CUDA Unified Memory” technique makes sure, these data transfers can be done automatically, frequent data transfers can easily become a bottleneck of the code. Therefore, we implemented additional pragmas to instruct the compiler when a data transfer is actually required.
The pragma
In total, about 60 pragmas had to be implemented in the code to parallelize the computational loops and to handle the data management to facilitate the GPU porting of MPTRAC by means of OpenACC. Code parts that were ported to GPUs are highlighted in the call graph shown in Fig.
The model verification and performance analysis described in this section was conducted on the JUWELS system
The JUWELS Booster consists of 936 compute nodes, each equipped with four NVIDIA A100 Tensor Core GPUs. Each NVIDIA A100 GPU comprises 6912 INT32 and 6912 FP32 compute cores as well as 3456 FP64 compute cores, the latter being most relevant for us because most calculations in MPTRAC are conducted at double precision. It also features 432 tensor cores, providing
The JUWELS Booster GPUs are hosted by AMD EPYC Rome 7402 CPUs. Each compute node of the Booster comprises two CPU sockets, four non-uniform memory access (NUMA) domains per CPU, and six compute cores per NUMA domain, which provide two-way simultaneous multithreading. Up to 48 physical threads or 96 virtual threads can be executed on the CPUs of a compute node. The compute nodes are equipped with 512 GB DDR4-3200 RAM. They are connected by a Mellanox HDR200 InfiniBand ConnectX 6 network in a DragonFly+ topology. CPUs, GPUs, and network adapters are connected via two PCIe Gen 4 switches with 16 PCIe lanes going to each device. A 350 GB s
The software environment on JUWELS is based on the CentOS 8 Linux distribution. Compute jobs are managed by the Slurm batch system with ParTec's ParaStation resource management. The JUWELS Booster software stack comprises the CUDA-aware ParTec ParaStation MPI and the NVIDIA HPC Software Development Kit (SDK), as used in this work. In particular, we used the PGI C Compiler (PGCC) version 21.5, which more recently was rebranded and integrated as NVIDIA C compiler (NVC) in NVIDIA's HPC SDK, and the GNU C compiler (GCC) version 10.3.0 to compile the GPU and CPU code, respectively. We applied the compile flag for strong optimization (
In this section, we discuss comparisons of kinematic trajectory calculations conducted with the CPU and GPU versions of the MPTRAC model. For the CPU simulations, we considered binaries created with two different C compilers. The CPU code compiled with GCC is considered here as a reference, as this compiler has been used in most of the previous work and earlier studies with MPTRAC. The second version of the CPU binaries as well as the GPU binaries were compiled with PGCC, which is the recommended compiler for GPU applications on the JUWELS Booster.
We first focus on comparisons of kinematic trajectories that were calculated with horizontal wind and vertical velocity fields of the ERA5 reanalysis. The calculations were conducted without subgrid-scale wind fluctuations, turbulent diffusion, or convection, as these parametrizations rely on different random number generators implemented in the cuRAND and GSL libraries and utilize a different parallelization approach for random number generation for the CPU and GPU code, respectively. Individual trajectories calculated with these modules being activated are therefore not directly comparable to each other. They can only be compared in a statistical sense, as in Sect.
We globally distributed the trajectory seeds in the pressure range from the surface up to 0.1 hPa (about 0–64 km of log-pressure altitude). The starting time of the trajectories was 1 January 2017, 00:00 UTC, and the calculations cover 60 d of trajectory time. To evaluate the trajectories, we calculated AHTDs, AVTDs, RHTDs, and RVTDs at 6-hourly time intervals (see Sect.
The absolute transport deviations between the GPU and CPU versions of MPTRAC with the GCC and PGCC compilers are shown in Fig.
Absolute
Note that the CPU and GPU binaries created by PGCC (colored curves in Fig.
With our current code, we cannot directly compare individual trajectories including the effects of turbulent diffusion and subgrid-scale winds as well as convection because these modules add stochastic perturbations to the trajectories that are created by means of the different random number generators of the cuRAND library for the GPUs and the GSL library for the CPUs. In order to compare CPU and CPU simulations considering these modules, we conducted global transport simulations of synthetic tracers with ERA5 data. In these simulations, we included the tropospheric synthetic tracers (E90 and NH50) as well as the stratospheric synthetic tracer (ST80) (Sect.
As a representative example, Fig.
Monthly mean zonal means of the synthetic tracers ST80, E90, and NH50 in July 2017 from GPU calculations
Figure
Figure
Some parts of the MPTRAC code cannot or have not been ported to GPUs. This includes the file input and output operations, which are directed via the hosting CPUs and the CPU memory, and the functions that are used for preprocessing of the meteorological data, such as the codes to calculate geopotential heights, potential vorticity, convective available potential energy, the planetary boundary layer, the tropopause height, and others. For those parts of the MPTRAC code that have not been ported to GPUs, it is important to study the OpenMP scalability on the host devices. In particular, good OpenMP scaling up to 12 threads on the JUWELS Booster needs to be demonstrated, as this is the fraction of physical cores shared by each GPU device on a JUWELS compute node.
Here, we focus on an OpenMP strong scaling test conducted by means of binaries compiled with the PGI compiler, as this is the recommended compiler for GPU applications on JUWELS Booster. For this test, the OpenMP static scheduling strategy was applied. The pinning of the threads to the CPU cores was optimized compared to the default settings of the system by using a fixed assignment of the threads to the sockets of the nodes (by setting the environment variables
The results of the OpenMP strong scaling test are shown in Fig.
Runtime
In this section, we discuss the GPU scaling of MPTRAC simulations conducted on JUWELS Booster nodes with respect to the problem size, i.e., the number of particles or trajectories that are calculated. The analysis of the scaling behavior with respect to the problem size is of interest because the NVIDIA A100 GPUs provide a particularly large number of parallel compute elements (Sect.
Figure
Scaling analysis showing
The analysis of individual contributions shows that reading of the ERA5 input data (60 to 64 s, referred to as INPUT in Fig.
The same scaling test was used to estimate the speed-up achieved by applying the GPU device over a simulation that was conducted on CPUs only. Note that defining a GPU-over-CPU speed-up is a difficult task, in general, as the results will depend not only on the GPU device but also on the individual compute capacities of both the CPU and GPU devices. Nevertheless, to estimate this speed-up, we conducted corresponding CPU-only simulations and runtime measurements by applying 12 cores of a JUWELS Booster compute node. A number of 12 cores was chosen here, as this corresponds to the number of physical cores sharing a GPU device on a booster node. We think that this specific approach of estimating the GPU-over-CPU speed-up provides a fair and more realistic comparison in contrast to comparing the GPU compute capacity to just a single CPU core, for example.
The estimated speed-up due to applying the GPU devices of a JUWELS Booster compute node is shown in Fig.
Finally, we conducted a more detailed runtime profile analysis of the GPU simulations based on about 40 individual timers, which are implemented directly into the code. Table
Runtime profile of an MPTRAC simulation using ERA5 input data with
Next to the physics modules, other parts of the code required about 33 % of the overall runtime of this large-scale simulation. Reading the input data required about 15 % of the runtime, where most time was needed to read the 3-D level data of the meteorological input data. A similar amount of time (about 15 %) was required for the processing of the meteorological data, with most time spent on calculating CAPE (5.4 %), geopotential heights (4.9 %), and the tropopause (2.1 %). The runtime required for output was 2.4 % in this simulation. However, it needs to be considered that output was written only 6-hourly here. The runtime required for output would scale accordingly for more frequent output. The runtime required for data transfers between GPU and CPU memory was also about 1 %, with most time spent on transferring the input meteorological data from CPU to GPU memory. Although this profile analysis did not reveal any major bottlenecks, there is room for further improving the code of MPTRAC. Most attention should be devoted to further optimizing the advection module, as it requires most of the runtime for large-scale simulations.
For a more detailed analysis of the large-scale MPTRAC simulation comprising
Timeline analysis of a 24 h MPTRAC simulation comprising
The black bars on top of Fig.
The blue bars in the middle of Fig.
The colored bars at the bottom of Fig.
Overall, the timeline analysis indicates that the code is providing a high utilization rate of the GPU devices, which is a basic requirement for using the GPU devices efficiently. However, the timeline analysis also reveals optimization opportunities. Overlapping file I/O and GPU computations can hide the file I/O costs with the expense of a slight increase in memory usage and reduce the wall-clock time significantly. As the meteorological data are finally required on the GPU devices to conduct the Lagrangian transport simulations, it could also be beneficial to port the meteorological data processing from the CPU to the GPU devices to accelerate the processing.
Finally, a more detailed inspection of the physics calculations on the GPUs shows that about 70 % of the time is spent in the advection module, which is used to solve the trajectory equation. A more detailed analysis indicates that this part of the code is memory bound; i.e., the runtime is limited by the GPU's global memory bandwidth. Future work should focus on optimizing this specific bottleneck of the GPU code by improving data locality and memory access patterns on the GPU devices.
MPTRAC provides a hybrid MPI–OpenMP–OpenACC parallelization to fully exploit the compute capacities of the JUWELS Booster. Each compute node of the booster is equipped with four NVIDIA A100 Tensor Core GPU devices. In order to utilize the capacities of the compute nodes, a multi-GPU approach is required. In MPTRAC, multiple GPU devices can be utilized by means of the MPI parallelization. With the current approach, each MPI task is assigned a single GPU device to offload computations; i.e., four MPI tasks will be operating on each compute node in the multi-GPU mode. At the same time, each MPI task can access up to 12 physical CPU cores of each node by means of the OpenMP parallelization. This scheme is particularly suited for ensemble simulations, i.e., to run many distinct MPTRAC simulations as parallel MPI tasks.
Figure
MPI weak scaling test for MPTRAC simulations on the JUWELS Booster compute nodes. Each task utilized one GPU device and shared 12 CPU cores. The simulations cover a 6 h time period starting on 1 January 2017, 00:00 UTC and used ERA5 input data. The simulations were initialized with a fixed number of
We measured the total runtime as well as the runtime for the physics calculations, meteorological data processing, file input, and file output for each MPI task. Figure
In this paper, we provide a comprehensive description of the Lagrangian transport model, MPTRAC version 2.2. We give an overview on the main features of the model and briefly introduce the code structure. Requirements for the model input data, i.e., global meteorological reanalysis or forecast data as well as particle data, are discussed. MPTRAC provides the functionality to calculate various additional meteorological data from basic prognostic variables, such as the 3-D fields of pressure, temperature, winds, water vapor, and cloud ice and liquid water content. This includes functions to calculate geopotential heights and potential vorticity, to calculate additional cloud properties, such as the cloud-top pressure and the total column cloud water, to estimate the convective available potential energy, and to determine the boundary layer and the tropopause height level. Some evaluation of the results of the meteorological data processing code of MPTRAC with data provided along with ECMWF's ERA5 reanalysis is presented in the Supplement of this paper.
As its main component, MPTRAC provides an advection module to calculate the trajectories of air parcels based on the explicit midpoint method using given wind fields. Individual stochastic perturbations can be added to the trajectories to account for the effects of diffusion and subgrid-scale wind fluctuations. Additional modules are implemented to simulate the effects of unresolved convection, wet and dry deposition, sedimentation, hydroxyl chemistry, and other types of exponential loss processes of particle mass. MPTRAC provides a variety of output options, including the particle data itself, gridded data, profile data, station data, sampled data, ensemble data, and verification statistics and other measures of the model skills. Additional tools are provided to further analyze the particle output, including tools to calculate statistics of particle positions and transport deviations over time.
Next to providing a detailed model description, the focus of this study was to assess the potential for accelerating Lagrangian transport simulations by exploiting GPUs. We ported the Lagrangian transport model MPTRAC to GPUs by means of the OpenACC programming model. The GPU porting mainly comprised (i) creating a large data region over most of the code to keep the particle data in GPU memory during the entire simulation, (ii) implementing data transfers of the meteorological input data from CPU to GPU memory, (iii) implementing data transfers of the particle data from GPU to CPU memory for output, (iv) parallelization and offloading of the compute loops for the trajectories and other physical processes of the particles, and (v) removing calls to the GNU Scientific Library, which is not available for GPUs, and adding calls to the cuRAND library for random number generation on the GPUs. Next to various minor changes and fixes, about 60 OpenACC pragmas had to be introduced into the code to manage the data transfers and to offload the compute loops. With the OpenACC approach used here, it was possible to maintain the same code base for both the CPU and the GPU version of MPTRAC.
We verified and evaluated the GPU version of MPTRAC on the JUWELS Booster at the Jülich Supercomputing Centre, providing access to the latest generation of NVIDIA A100 Tensor Core GPUs. The verification is mostly focusing on comparison of CPU- and GPU-based simulations. A direct comparison of kinematic trajectories showed negligible deviations between the CPU and GPU code for up to 30 d of simulation time. The relative deviations slowly increase to about 0.3 % in the horizontal domain and 0.4 % in the vertical domain after 60 d. To evaluate the impact of additional processes, i.e., the diffusion and subgrid-scale wind fluctuations, convection, and exponential loss of particle mass, we conducted CPU and GPU simulations of synthetic tracers. The simulations for tropospheric tracers E90 and NH50 as well as the stratospheric tracer ST80 showed that the differences between the monthly mean zonal means from the CPU and GPU simulations are well below
The performance and scaling behavior of the MPI–OpenMP–OpenACC hybrid parallelization of MPTRAC was also assessed on the JUWELS Booster. The OpenMP strong scaling analysis showed satisfying scalability of those parts of the code, which were not ported to GPUs. In particular, the code for meteorological data preprocessing showed an OpenMP parallel efficiency of about 85 % to 90 % for up to 12 physical CPU cores of a compute node. The GPU scaling analysis showed that positive speed-up in the GPU simulations compared to CPU-only simulations is achieved for
We identified several opportunities for potential improvements of the GPU code of the MPTRAC model in future work. For large-scale simulations, comprising
MPTRAC is made available under the terms and conditions of the GNU General Public License (GPL) version 3. Release version 2.2 of MPTRAC, described in this paper, has been archived on Zenodo
MPTRAC performance analysis data obtained with NVIDIA Nsight Systems are made available via an open data repository (
The supplement related to this article is available online at:
LH developed the concept for this study and conducted most of the model development, GPU porting, verification, and performance analyses. PFB and KHM provided expertise on GPU usage and contributed to the performance analysis and optimization of the code. JC, ZC, SG, YH, ML, XW, and LZ are developers and users of the MPTRAC model and provided many suggestions for improving the model. NT and BV provided expertise on Lagrangian transport modeling and model development. GG and OS were responsible for downloading and preparing the ERA5 data for this study. LH wrote the manuscript with contributions from all co-authors.
The contact author has declared that neither they nor their co-authors have any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work described in this paper was supported by the Helmholtz Association of German Research Centres (HGF) through the projects Pilot Lab Exascale Earth System Modelling (PL-ExaESM) and Joint Lab Exascale Earth System Modelling (JL-ExaESM) as well as the Center for Advanced Simulation and Analytics (CASA) at Forschungszentrum Jülich. We acknowledge the Jülich Supercomputing Centre for providing computing time and storage resources on the JUWELS supercomputer and for selecting MPTRAC to participate in the JUWELS Booster Early Access Program. We also acknowledge mentoring and support provided during the 2019 Helmholtz GPU hackathon. We are thankful to Sebastian Keller, CSCS, who helped us to develop the first version of MPTRAC that was capable of running on GPUs.
Xue Wu was supported by the National Natural Science Foundation of China (grant nos. 41975049 and 41861134034).The article processing charges for this open-access publication were covered by the Forschungszentrum Jülich.
This paper was edited by Christoph Knote and reviewed by two anonymous referees.