Lagrangian particle tracking schemes allow a wide range of flow and transport processes to be simulated accurately, but a major challenge is numerically implementing the inter-particle interactions in an efficient manner. This article develops a multi-dimensional, parallelized domain decomposition (DDC) strategy for mass-transfer particle tracking (MTPT) methods in which particles exchange mass dynamically. We show that this can be efficiently parallelized by employing large numbers of CPU cores to accelerate run times. In order to validate the approach and our theoretical predictions we focus our efforts on a well-known benchmark problem with pure diffusion, where analytical solutions in any number of dimensions are well established. In this work, we investigate different procedures for “tiling” the domain in two and three dimensions (2-D and 3-D), as this type of formal DDC construction is currently limited to 1-D. An optimal tiling is prescribed based on physical problem parameters and the number of available CPU cores, as each tiling provides distinct results in both accuracy and run time. We further extend the most efficient technique to 3-D for comparison, leading to an analytical discussion of the effect of dimensionality on strategies for implementing DDC schemes. Increasing computational resources (cores) within the DDC method produces a trade-off between inter-node communication and on-node work. For an optimally subdivided diffusion problem, the 2-D parallelized algorithm achieves nearly perfect linear speedup in comparison with the serial run-up to around 2700 cores, reducing a 5 h simulation to 8 s, while the 3-D algorithm maintains appreciable speedup up to 1700 cores.

Numerical models are used to represent physical problems that may be difficult to observe directly (such as groundwater flow) or that may be tedious, expensive, or even impossible to currently study via other methods. In the context of groundwater flow, for example, these models allow us to portray transport in heterogeneous media and bio-chemical species interaction, which are imperative to understanding a hydrologic system's development (e.g.,

Lagrangian methods for simulating reactive transport continue to evolve, providing both increased accuracy and accelerated efficiency over their Eulerian counterparts by eliminating numerical dispersion (see

Early efforts to directly simulate bimolecular reactions with RWPT algorithms

The objective of this study is to develop efficient, multi-dimensional parallelization schemes for MTPT-based reactive transport schemes.
We conduct formal analyses to provide cost benchmarks and to predict computational speedup for the MTPT algorithm, both of which to date were only loosely explored in the 1-D case

An equation for a chemically conservative, single component system experiencing local mean velocity and Fickian diffusion-like dispersion is

We define a general and well-established benchmark test problem to facilitate the analysis of speedup and computational efficiency. Based on the chosen tiling method, the global domain is subdivided into equi-sized subdomains, and each core knows its own local, non-overlapping domain limits. The particles are then load balanced between the cores and randomly scattered within the local domain limits. To represent the initial mass distribution, we use a Heaviside function in an

Unless otherwise stated, all 2-D simulations will be conducted with the following computational parameters: the

Panel

We choose parameters in an attempt to construct computationally intense problems that do not exceed available memory. A roofline analysis plot

The simulations in this paper were performed on a heterogeneous high-performance computing (HPC) cluster called Mio, which is housed at the Colorado School of Mines. Each node in the cluster contains 8–24 cores, with clock speeds ranging from 2.50 to 3.06 GHz and memory capacity ranging from 24 to 64 GB. Mio uses a network of Infiniband switches to prevent performance degradation in multi-node simulations. We use the compiler gfortran with optimization level

The MTPT method simulates diffusion by weighted transfers of mass between nearby particles. These weights are defined by the relative proximity of particles that is determined by constructing and searching a k-D tree

With the foundation of the algorithm established, we focus on comparing alternative tiling strategies within the domain decomposition method and their subsequent performance.

The first approach extends the 1-D technique by slicing the 2-D domain along a single dimension, depending on how many cores are available for use. For example, depending on the number of computational cores allocated for a simulation, we define the width of each region as

General schematics of

In addition to the slices method, we consider decomposing the domain in a tiled or checkerboard manner. Given a

The algorithm does not incur noteworthy changes in error, as a function of

In MTPT algorithms, nearby particles must interact with each other. Specifically, a particle will exchange mass with all nearby particles within the search radius in Eq. (

The process we describe here differs depending on the decomposition method. For example, the slices method gives nearby neighbors only to the left and to the right (Fig.

All particles within a buffer width of

The communication portion of the algorithm becomes more complicated as spatial dimensions increase. In 3-D, we decompose the domain using a similar method to prescribe a tiling as in 2-D, but the extra sends and receives to nearby cores significantly increase. For example, the 2-D algorithm must pass information to eight nearby cores, whereas the 3-D algorithm must pass information to 26 nearby cores – eight neighboring cores on the current plane and nine neighboring cores on the planes above and below.

The red band represents all particles that will be received by the local subdomain (yellow) from neighboring regions for mass transfer. The number

In this section, we characterize and predict the amount of work being performed within each of portion of the algorithm. The general discussion of work and cost here refer to the run times required within distinct steps of the algorithm. We profile the code that implements the MTPT algorithm using the built-in, Unix-based profiling tool gprof

We begin with an analysis of the MT work. First, in the interest of tractability, we will consider only regular domains, namely a square domain with sides of length

Plots of run time in the MT portion of benchmark runs. Note the similar behavior in both 2-D

Note that changing the total number of particles within a simulation should not change the scaling relationship as Eq. (

Varying the total particle number

Varying the total particle number

The plots of MT run time display an approximately linear decrease as

The halo exchange depicted in Sect.

We observe expected wall time scaling for the on-node communication shown here, demonstrating that the MPI function calls do, in fact, incur communication costs between cores within a shared-memory space.

The communication portion of the algorithm includes three processes on each core: evaluating logical statements to determine which particles to send to each neighboring core, sending the particles to the neighboring cores, and receiving particles from all neighboring cores. The total wall time for these three processes makes up the “MPI time (s)” in Figs.

As

In this section, we discuss the advantages and limitations of each method by evaluating the manner in which the decomposition strategies accelerate run times. We employ the quantity “speedup” in order to compare the results of our numerical experiments. The speedup of a parallelized process is commonly formulated as

We may also construct a theoretical prediction of the expected speedup due to the run time analysis of the preceding section. First, assume that the subdomains are ideally configured as squares in 2-D or cubes on 3-D. In this case, the MT run times always exceed the MPI times. For smaller values of

This gives the user a couple of options before running a simulation. The first option is to choose a desired minimum efficiency and obtain a suggested number of cores to use based on the inequality in Eq. (

The prediction curves give the user a concrete guideline to determine how many cores to allocate for a simulation before performance degrades. The curves in

Using Eq. (

The 2-D and 3-D benchmark simulations used in previous sections allow us to calculate both the empirical (observed) and theoretical speedups, and the overlays in Figs.

Observed (diamonds) and theoretical speedup for 2-D simulations. Each chosen number of cores is a perfect square so that the checkerboard method gives square subdomains. With the chosen parameters

Observed (diamonds) and theoretical speedup for 3-D simulations. Each chosen number of cores is a perfect cube so that the checkerboard method gives cubic subdomains. With the chosen parameters

Finally, we briefly consider the slices method, as it has drastic limitations in 2-D and 3-D. Increasing the number of cores used in a simulation while

Speedup for the slices method plateaus quickly, as the ghost regions increase in proportion to the local subdomain's area.

Given some fixed number of cores (hence subdomains), it is clear that using a subdomain tiling that is as close as possible to a perfect square (or cube) maximizes efficiency.
This occurs when the factors for subdivisions in each dimension are chosen to most closely resemble the aspect ratio of the entire domain (shown in 2-D in Eq.

Poorly chosen core numbers may result in severely non-square tilings that can degrade speedup performance, despite employing more computational resources.

We can loosely describe the standard definition of speedup as the quantitative advantage a simulation performed with

A speedup reference point of

Speedup reference points of

The performance in Fig.

The checkerboard decomposition for the parallelized DDC algorithm provides significant speedup to Lagrangian MTPT simulations in multiple dimensions. For a range of simulations, we find that the mass-transfer step is the dominant cost in terms of run time. The approximate linearity of run time with

Given that we assume a purely diffusive, non-reactive system in this paper, a necessary extension of this will be an investigation of the performance of these DDC techniques upon adding advection, reactions, or both to the system.
The benchmarks we establish in this work provide accurate expectations for properly load-balanced problems as we begin to add these complexities into the simulations.
In particular, in future efforts we expect to handle variable velocity fields with an adaptable, moving DDC strategy similar to the technique that we implement in this paper. This initial study is an essential prerequisite before other such advances can even begin.
Further, using local averaging and interpolation of the corresponding velocity-dependent dispersion tensors, we retain accurate representation of small-scale spreading and mixing despite the subdivided simulation domain.
We expect that adequately load balanced, advective-reactive simulations in the future will exhibit as good as or even better scaling than we have observed in these current benchmarks.
Moving particles via advection is a naturally parallel computation and will only incur minor computational cost as particles move across subdomain boundaries.
Moreover, since the general form of our algorithm simulates complex chemical reactions on particles after mass transfer takes place for all species, we change the most computationally expensive process (the reactions) into a naturally parallel process.
Simulating these extra physical phenomena strictly increases computational complexity, but it will be in addition to the computations we carry out here.
Given that we are building on these computations, we fully intend to preserve the predictive ability in more complex simulations moving forward.
Finally, another natural extension could address various computational questions by exploring how parallelized MTPT techniques might be employed using shared-memory parallelism, such as OpenMP, CUDA, or architecture-portable parallel programming models like Kokkos, RAJA, or YAKL

In summary, the checkerboard method in 2-D (and 3-D) not only allows simulations to be conducted using large numbers of cores before violating the maximum recommended core condition given in Eq. (

Roofline model analysis provides an understanding of the baseline computational performance of an algorithm by observing the speed and intensity of its implementation on a specific problem and comparing it to the hardware limitations. Thus, this analysis depicts the speed of an algorithm relative to the difficulty of the associated problem. We include this analysis to demonstrate that our single-core baseline result (orange circle in Fig.

This figure displays the full roofline plot with an enlarged inset to accentuate the data trends. The data points correspond to various particle densities while domain size is held constant. As arithmetic intensity fluctuates, we observe that the algorithm's performance is limited by the hardware's bandwidth. We use the chosen parameter set (orange circle) for our speedup results since it (1) is the most efficient, (2) displays high accuracy from its particle count, and (3) is not memory-bound and does not require exceedingly long simulation times.

Fortran/MPI codes for generating all results in this paper are held in the public repository at

LS was responsible for writing the original draft and creating the software. LS, SDP, DAB, and DB were responsible for conceptualization. LS, NBE, and MJS were responsible for methodology. DAB and SDP were responsible for formal analysis. All authors were responsible for reviewing and editing the paper.

The contact author has declared that none of the authors has any competing interests.

This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Sandia National Laboratories is a multi-mission laboratory managed and operated by the National Technology and Engineering Solutions of Sandia, L.L.C., a wholly owned subsidiary of Honeywell International, Inc., for the DOE's National Nuclear Security Administration under contract DE-NA0003525. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.

The authors and this work were supported by the US Army Research Office (grant no. W911NF-18-1-0338) and by the National Science Foundation (grant nos. CBET-2129531, EAR-2049687, EAR-2049688, DMS-1911145, and DMS-2107938).

This paper was edited by David Ham and reviewed by two anonymous referees.