Parallelized domain decomposition for multi-dimensional Lagrangian random walk mass-transfer particle tracking schemes

Schauer, Lucas; Schmidt, Michael J.; Engdahl, Nicholas B.; Pankavich, Stephen D.; Benson, David A.; Bolster, Diogo

doi:https://doi.org/10.5194/gmd-16-833-2023

Articles | Volume 16, issue 3

https://doi.org/10.5194/gmd-16-833-2023

Articles | Volume 16, issue 3

Development and technical paper

03 Feb 2023

Development and technical paper |

| 03 Feb 2023

Parallelized domain decomposition for multi-dimensional Lagrangian random walk mass-transfer particle tracking schemes

Lucas Schauer, Michael J. Schmidt, Nicholas B. Engdahl, Stephen D. Pankavich, David A. Benson, and Diogo Bolster

Abstract

Lagrangian particle tracking schemes allow a wide range of flow and transport processes to be simulated accurately, but a major challenge is numerically implementing the inter-particle interactions in an efficient manner. This article develops a multi-dimensional, parallelized domain decomposition (DDC) strategy for mass-transfer particle tracking (MTPT) methods in which particles exchange mass dynamically. We show that this can be efficiently parallelized by employing large numbers of CPU cores to accelerate run times. In order to validate the approach and our theoretical predictions we focus our efforts on a well-known benchmark problem with pure diffusion, where analytical solutions in any number of dimensions are well established. In this work, we investigate different procedures for “tiling” the domain in two and three dimensions (2-D and 3-D), as this type of formal DDC construction is currently limited to 1-D. An optimal tiling is prescribed based on physical problem parameters and the number of available CPU cores, as each tiling provides distinct results in both accuracy and run time. We further extend the most efficient technique to 3-D for comparison, leading to an analytical discussion of the effect of dimensionality on strategies for implementing DDC schemes. Increasing computational resources (cores) within the DDC method produces a trade-off between inter-node communication and on-node work. For an optimally subdivided diffusion problem, the 2-D parallelized algorithm achieves nearly perfect linear speedup in comparison with the serial run-up to around 2700 cores, reducing a 5 h simulation to 8 s, while the 3-D algorithm maintains appreciable speedup up to 1700 cores.

Download & links

Article (PDF, 5429 KB)

Download & links

How to cite.

Received: 11 Aug 2022 – Discussion started: 22 Aug 2022 – Revised: 13 Jan 2023 – Accepted: 13 Jan 2023 – Published: 03 Feb 2023

1 Introduction

Numerical models are used to represent physical problems that may be difficult to observe directly (such as groundwater flow) or that may be tedious, expensive, or even impossible to currently study via other methods. In the context of groundwater flow, for example, these models allow us to portray transport in heterogeneous media and bio-chemical species interaction, which are imperative to understanding a hydrologic system's development (e.g., Dentz et al., 2011; Perzan et al., 2021; Steefel et al., 2015; Scheibe et al., 2015; Tompson et al., 1998; Schmidt et al., 2020 b; Li et al., 2017; Valocchi et al., 2019). Since geological problems frequently require attention to many separate, yet simultaneous processes and corresponding physical properties, such as local mean velocity (advection), velocity variability (dispersion), mixing (e.g., dilution), and chemical reaction, we must apply rigorous methods to ensure proper simulation of these processes. Recent studies (e.g., Benson et al., 2017; Bolster et al., 2016; Sole-Mari et al., 2020) have compared classical Eulerian (e.g., finite-difference or finite-element) solvers to newer Lagrangian methods and have shown the relative advantages of the latter. Therefore, in this paper we explore several approaches to parallelize a Lagrangian method that facilitate the simulation of the complex nature of these problems. Given that all of the complex processes noted above must ultimately be incorporated and that this is the first rigorous study of this kind, we focus on well-established and relatively simple benchmark problems with analytical solutions to derive a rigorous approach to this parallelization.

Lagrangian methods for simulating reactive transport continue to evolve, providing both increased accuracy and accelerated efficiency over their Eulerian counterparts by eliminating numerical dispersion (see Salamon et al., 2006) and allowing direct simulation of all subgrid processes (Benson et al., 2017; Ding et al., 2017). Simulation of advection and dispersion (without reaction) in hydrogeological problems began with the Lagrangian random walk particle tracking (RWPT) algorithm that subjects an ensemble of particles to a combination of velocity and diffusion processes (LaBolle et al., 1996; Salamon et al., 2006). Initially, chemical reactions were added in any numerical time step by mapping particle masses to concentrations via averaging over Eulerian volumes, then applying reaction rate equations, and finally mapping concentrations back to particle masses for RWPT (Tompson and Dougherty, 1988). This method clearly assumes perfect mixing within each Eulerian volume because subgrid mass and concentration perturbations are smoothed (averaged) prior to reaction. The subsequent over-mixing was recognized to induce a scale-dependent apparent reaction rate that depended on the Eulerian discretization (Molz and Widdowson, 1988; Dentz et al., 2011), thus eliminating some of the primary benefits of the Lagrangian approach. In response, a method that would allow reactions directly between particles was devised and implemented (Benson and Meerschaert, 2008).

Early efforts to directly simulate bimolecular reactions with RWPT algorithms (Benson and Meerschaert, 2008; Paster et al., 2014) were originally founded on a birth–death process that calculated two probabilities: one for particle-particle collocation and a second for reaction and potential transformation or removal given collocation (i.e., particles that do not collocate cannot react, thus preserving incomplete mixing). The next generation of these methods featured a newer particle-number-conserving reaction scheme. This concept, introduced by Bolster et al. (2016) and later generalized (Benson and Bolster, 2016; Schmidt et al., 2019; Sole-Mari et al., 2019), employs kernel-weighted transfers for moving mass between particles, where the weights are equivalent to the abovementioned collision probabilities under certain modeling choices. These algorithms preserve the total particle count, and we refer to them as mass-transfer particle tracking (MTPT) schemes. These particle-conserving schemes address low-concentration resolution issues that arise spatially when using particle-killing techniques (Paster et al., 2013; Benson et al., 2017). Furthermore, MTPT algorithms provide a realistic representation of solute transport with their ability to separate mixing and spreading processes (Benson et al., 2019). Specifically, spreading processes due to small-scale differential advection may be simulated with standard random walk techniques (LaBolle et al., 1996), and true mixing-type diffusive processes may be simulated by mass transfers between particles. MTPT techniques are also ideally suited to, and provide increased accuracy for, complex systems with multiple reactions (Sole-Mari et al., 2017; Engdahl et al., 2017; Benson and Bolster, 2016; Schmidt et al., 2020 b), but they are computationally expensive because nearby particles must communicate. This notion of nearness is discussed in detail in Sect. 3.

The objective of this study is to develop efficient, multi-dimensional parallelization schemes for MTPT-based reactive transport schemes. We conduct formal analyses to provide cost benchmarks and to predict computational speedup for the MTPT algorithm, both of which to date were only loosely explored in the 1-D case (Engdahl et al., 2019). Herein, we focus on an implementation that uses a multi-CPU environment that sends information between CPUs via Message Passing Interface (MPI) directives within Fortran code. In particular, we focus on the relative computational costs of the inter-particle mass transfer versus message passing algorithms because the relative costs of either depend upon the manner in which the computational domain is split among cores. These mass-transfer methods may be directly compared to smoothed-particle hydrodynamics (SPH) methods and are equivalent when a Gaussian kernel is chosen to govern the mass transfers (Sole-Mari et al., 2019). Specifically, this work shares similarities with previous investigations of parallelized SPH methods (Crespo et al., 2011; Gomez-Gesteira et al., 2012; Xia and Liang, 2016; Morvillo et al., 2021) but is novel as it tackles nuances that arise specifically for MTPT approaches. A substantial difference within this work is that the kernels are based on the local physics of diffusion rather than a user-defined function chosen for attractive numerical qualities like compact support or controllable smoothness. This adherence to local physics allows for increased modeling fidelity, including the simulation of diffusion across material discontinuities or between immobile (solid) and mobile (fluid) species (Schmidt et al., 2020 a, 2019). In general, the parallelization of particle methods depends on assigning groups of particles to different processing units. Multi-dimensional domains present many options on how best to decompose the entire computational domain in an attempt to efficiently use available computing resources. Along these lines, we compare two different domain decomposition (DDC) approaches. In the one-dimensional case (Engdahl et al., 2019), the specified domain is partitioned into smaller subdomains so that each core is only responsible for updating the particles' information inside of a fixed region, though information from particles in nearby subdomains must be used. Hence, the first two-dimensional method we consider is a naive extension from the existing one-dimensional technique (Engdahl et al., 2019) that decomposes the domain into vertical slices along the x axis of the xy plane. This method is attractive for its computational simplicity but limits speedup for large numbers of cores (see Sect. 7). Our second method decomposes the domain into a “checkerboard” consisting of subdomains that are as close to squares (or cubes) as is possible given the number of cores available. RWPT simulations without mixing often require virtually no communication across subdomain boundaries because all particles act independently in the model. However, MTPT techniques require constant communication along local subdomain boundaries at each time step, which leads to challenges in how best to accelerate these simulations without compromising the quality of solutions. This novel, multi-dimensional extension of parallelized DDC techniques for the MTPT algorithm will now allow for the simulation of realistic, computationally expensive systems in seconds to minutes rather than hours to days. Further, based on given simulation parameters, we provide formal run time prediction analysis that was only hypothesized in previous work and will allow future users to optimize parallelization prior to executing simulations. This paper rigorously explores the benefits of our parallelized DDC method while providing guidelines and cautions for efficient use of the algorithm.

2 Model description

An equation for a chemically conservative, single component system experiencing local mean velocity and Fickian diffusion-like dispersion is

\begin{matrix} (1) & \frac{\partial C}{\partial t} + \nabla \cdot (v C) = \nabla \cdot (D \nabla C), x \in Ω \subseteq R^{d}, t > 0, \end{matrix}

where C(x,t) [mol L^−d] is the concentration of a quantity of interest, v(x,t) [LT⁻¹] is a velocity field, and D(v) [L²T⁻¹] is a given diffusion tensor. Advection–diffusion equations of this form arise within a variety of applied disciplines relating to fluid dynamics (Bear, 1972; Tennekes and Lumley, 1972; Gelhar et al., 1979; Bear, 1961; Aris, 1956; Taylor, 1953). Depending on the physical application under study, various forms of the diffusion tensor may result. Often, it can be separated into two differing components, with one representing mixing between nearby regions of differing concentrations and the other representing spreading from the underlying flow (Tennekes and Lumley, 1972; Gelhar et al., 1979; Benson et al., 2019). This decomposition provides a general splitting of the tensor into

\begin{matrix} (2) & D = D_{mix} (v) + D_{spread} (v) . \end{matrix}

Lagrangian numerical methods, such as those developed herein, can then be used to separate the simulation of these processes into mass-transfer algorithms that capture the mixing inherent to the system and random walk methods that represent the spreading component (see, e.g., Ding et al., 2017; Benson et al., 2019). As our focus here is mainly driven by the novel implementation of diffusive processes in MTPT algorithms, we will for now assume a purely diffusive system so that v(x)=0. This assumption results in an isotropic diffusion tensor that reduces to

\begin{matrix} (3) & D = D I_{d}, \end{matrix}

where I_d is the d×d identity matrix. The remaining scalar diffusion coefficient can also be separated into mixing and spreading components, according to

\begin{matrix} (4) & D = D_{mix} + D_{spread} . \end{matrix}

Despite the assumption of zero advection, we simulate spreading via random walks as an eventual necessity for moving particles within our DDC scheme. Stationary particles do not provide computational complexity for the mass-transfer algorithm as distances between particles remain constant.

2.1 Initial conditions and analytic solution

We define a general and well-established benchmark test problem to facilitate the analysis of speedup and computational efficiency. Based on the chosen tiling method, the global domain is subdivided into equi-sized subdomains, and each core knows its own local, non-overlapping domain limits. The particles are then load balanced between the cores and randomly scattered within the local domain limits. To represent the initial mass distribution, we use a Heaviside function in an L^d-sized domain, which assigns all particles with position $x \geq L / 2$ with mass M=1 and assigns no mass to particles with position $x < L / 2$ (i.e., a heaviside step function initial condition). This initial condition will allow us to assess the accuracy of simulations as, for an infinite domain (simulated processes occur away from boundaries for all time), it admits an exact analytical solution

\begin{matrix} (5) & C (x, t) = \frac{1}{2} erfc [- (x - x^{'}) / 4 D t], \end{matrix}

where $x^{'} = L / 2$ and t is the elapsed time of the simulation. The existence of an analytical solution is beneficial to our ability to rigorously test our proposed schemes. We compare simulated results to this solution using the root-mean-squared error (RMSE). Note that all dimensioned quantities are unitless for the analysis we conduct and all references to run times are measured in CPU wall clock time.

2.2 Simulation parameters

Unless otherwise stated, all 2-D simulations will be conducted with the following computational parameters: the L×L domain is fixed with L=1000; the time step is fixed to Δt=0.1; the number of particles is N=10⁷; and the diffusion constant is chosen to be D=1. The total time to be simulated is fixed as T=10, which results in 100 time steps during each simulation.

https://gmd.copernicus.org/articles/16/833/2023/gmd-16-833-2023-f01

Figure 1Panel (a) displays the computed particle masses at final simulation time T=10, and panel (b) provides a computed vs. analytical solution comparison at the corresponding time. The parameters for this run are N=10⁷, Δt=0.1, and D=1.

Parallelized domain decomposition for multi-dimensional Lagrangian random walk mass-transfer particle tracking schemes

2.1 Initial conditions and analytic solution

2.2 Simulation parameters

2.3 Hardware configuration

4.1 Slices method

4.2 Checkerboard method

6.1 Mass-transfer cost

6.2 Ghost particle communication cost analysis

7.1 Non-square tilings and checkerboard cautions

7.2 Non-serial speedup reference point