An explicit GPU-based material point method solver for elastoplastic problems (ep2-3De v1.0)

We propose an explicit GPU-based solver within the material point method (MPM) framework on a single graphics processing unit (GPU) to resolve elastoplastic problems under twoand three-dimensional configurations (i.e., granular collapses and slumping mechanics). Modern GPU architectures, including Ampere, Turing and Volta, provide a computational framework that is well suited to the locality of the material point method in view of high-performance computing. For intense and 5 nonlocal computational aspects (i.e., the back-and-forth mapping between the nodes of the background mesh and the material points), we use straightforward atomic operations (the scattering paradigm). We select the generalized interpolation material point method (GIMPM) to resolve the cell-crossing error, which typically arises in the original MPM, because of the C0 continuity of the linear basis function. We validate our GPU-based in-house solver by comparing numerical results for granular collapses with the available experimental data sets. Good agreement is found between the numerical results and experimental 10 results for the free surface and failure surface. We further evaluate the performance of our GPU-based implementation for the three-dimensional elastoplastic slumping mechanics problem. We report i) a maximum performance gain of x200 between a CPUand GPU-based implementation, provided that ii) the hardware limit (i.e., the peak memory bandwidth) of the device is reached. We finally showcase an application to slumping mechanics and demonstrate the importance of a three-dimensional configuration coupled with heterogeneous properties to resolve complex material behaviour. 15


Introduction
Graphics processing units, or GPUs, have revolutionized the entire field of high-performance computing (HPC) in the last decade. GPUs are many-core processors that were originally developed by the gaming industry in the mid-1990s to accelerate graphics and video rendering. Currently, GPUs are widely employed hardware accelerators used in various applications, including artificial intelligence (AI) and machine learning. GPUs are also increasingly used for high-performance scientific 20 computing (see Dong et al. 2015b;Omlin et al. 2018;Räss et al. 2018;Zhang et al. 2021;Alkhimenkov et al. 2021). The majority of the scientific algorithms on many-core (e.g., GPU) hardware accelerators are memory-bounded, meaning that data transferring (reading and writing) limits the performance of a solver. This is in contrast to the recent CPU-bounded algorithms, where arithmetic floating point calculations are the main limiting factor in solver performance. This GPU supercomputing In this section, we briefly describe the governing equations implemented in the MPM solver. We use a linear elastoplastic rheology. Large deformations are carried out via a rate-dependent formulation with the Jaumann stress rate.

Governing equations
The conservation of linear momentum is given by (using the Einstein summation convention) 70 where σ kl is the Cauchy stress tensor, v k = ∂u k /∂t is the velocity, u k is the displacement, g k is the body force, and k, l = 1..3.
The conservation of angular momentum is given by σ kl = σ lk . Dirichlet and Neumann boundary conditions are σ kl n l =τ k on ∂Ω τ , whereû k andτ k are prescribed displacements, and n k is a unit normal vector pointing outward from the boundary ∂Ω of the 75 domain Ω. Following the standard FEM procedure, we use the updated Lagrangian framework; thus, the weak form of Eq. 1 is written in the current spatial configuration. The weak form of Eq. 1 can be obtained by multiplying it with a test function φ and then applying integration by parts and divergence theorem, leading to where ∂v k /∂t = a k is the acceleration, φ is any test function that vanishes on ∂Ω u , andτ k is the external traction applied on 80 the boundary ∂Ω, k = 1..3. However, in our MPM implementation, tractions on the boundary are not used. Eq. 4 can be solved using a finite element approach leading to the following compact form: inversion (Sulsky et al., 1994;Bardenhagen and Kober, 2004;González Acosta et al., 2020). The external f ext k,n and internal f int k,n forces at node n are then defined by where m p is the material point's mass, v p is the material point's volume and σ kl,p is the material point's Cauchy stress tensor.

90
Solving Eq. 5 for the acceleration a k,n , the updated velocity is obtained via a forward-Euler scheme, v t+∆t k,n = v t k,n + ∆ta k,n , where the velocity is given by v t k,n = m −1 n np p=1 φ n (x p )m p v k,p and v k,p is the material point's velocity. Boundary conditions are enforced on the boundary nodes. The material point velocity v k,p and coordinates x k,p are defined by mapping (i.e., an interpolation) between the updated solution on the mesh and the material points, i.e., x t+∆t k,p = x t k,p + ∆t nn n=1 φ n (x p )v t+∆t k,n , where n n is the number of associated nodes n to a material point p. The remaining tasks are i) to update the material point volume and ii) to solve for the constitutive stress-strain relationship.

Rate formulation 100
The large deformation framework necessitates a suitable stress-strain formulation. Some studies prefer the finite deformation framework and employ a linear relationship between Kirchhoff stresses and logarithmic strains (Charlton et al., 2017;Gaume et al., 2018;Coombs et al., 2020). In the present work, we adopt a rate-dependent framework by applying the Jaumann rate (e.g., Huang et al. 2015;Wang et al. 2016c, b;Bandara et al. 2016), which yields an objective stress rate measure.
The Jaumann rate of the Cauchy stress is given by where C ijkl is the 4th rank tangent stiffness tensor. Thus, the Jaumann stress derivative may be written as where ω ij = (∂ i v j − ∂ j v i )/2 is the vorticity tensor, Dσ ij /Dt corresponds to the material derivative 110 By rearranging the Jaumann stress derivative in Eq. 12, we obtain where σ R ij represents the rotation of the Cauchy stress tensor, which satisfies the stress objectivity for the rate-dependent formulation.
Let us expand σ R ij in Eq. 14 using identities σ ij = σ ji ,ω ij = −ω ji andω kk = 0. The Cauchy stress tensor is written using the 115 so-called Voigt notation (as a vector σ = {σ xx , σ yy , σ zz , σ xy , σ yz , σ xz }). After expanding, collecting and rearranging terms, the objective stress terms σ R ij for a three-dimensional configuration are and, for a two-dimensional configuration assuming plane strain conditions, Eqs. 15, 16 and 18 reduce to

Elastoplastic deformation
A nonassociated Drucker-Prager model (D-P model) with a tension cutoff is used in this study, similar to Huang et al. (2015); Liu et al. (2020); Nguyen et al. (2020); Zuo et al. (2020), because of its straightforward implementation within explicit nu-130 merical solvers. The D-P model has been established as an approximation of the Mohr-Couloumb (M-C) model (Krabbenhoft et al., 2012;Alejano and Bobet, 2012), i.e., a conical yield surface that approximates the M-C yield surface in the principal stress space. The former can be adjusted by parameters, so it passes either through the outer or inner edges of the M-C yield surface (Jiang and Xie, 2011;De Borst et al., 2012).
The D-P yield function f (see Fig. 1) is typically defined in terms of invariants; The first invariant of the Cauchy stress 135 tensor I 1 = σ kk , and the second invariant J 2 = 1 2 τ ij τ ji of its deviatoric part τ ij , where the deviatoric part of the Cauchy stress is τ ij = σ ij + δ ij p with the pressure p = − 1 3 σ kk . The D-P yield surface is made of two surfaces (i.e., representing shear and tensile yield criteria), delimited by

140
where τ = √ J 2 is the effective shear stress, σ m = −p is the mean stress, q φ and k φ are the material parameters defined by φ as the internal friction angle, σ t is the tensile strength and c is the cohesion. Cohesion varies with the accumulated plastic strain¯ p when considering a strain softening material, i.e., c = f (¯ p ). These two surfaces define two plastic regions (see Fig. 1) corresponding to either the shear or tensile failure mode. We use a nonassociated plastic flow law for shear and tensile failures; thus, the plastic potential function g is written as where q ψ is a material parameter estimated with the dilation angle ψ.
The line segment h(σ m , τ ) = 0 represents the diagonal line between f s (σ m , τ ) = 0 and f t (σ m , τ ) = 0 in the (σ m , τ ) plane, i.e., h is the boundary between shear and tensile failure modes. The function h(σ m , τ ) is given by with the constants τ P = k φ − q φ σ t and α P = (1 − q 2 φ ) 1/2 − q 2 φ . We consider an inner adjustment of the D-P yield surface with respect to the M-C yield surface (de Souza Neto et al., 2011), and the model parameter used in Eqs. 24 & 26 are given by In the following, we briefly detail the return mapping strategy used to return the trial Cauchy stress σ tr ij (i.e., assuming pure elastic deformation only) onto the yield surfaces considering ψ = 0. A complete description of such return mapping can be found in Huang et al. (2015). Shear failure is declared when i) f s (σ tr m , τ tr ) > 0 and σ tr m < σ t or if ii) h(σ tr , τ tr ) > 0 and σ tr m ≥ σ t . The corrected Cauchy stress tensor now reads with δ the Kronecker tensor. Tensile failure is declared when h(σ tr , τ tr ) ≤ 0 and σ tr m ≥ σ t . The corrected Cauchy stress tensor reads as 3 GIMPM implementation under a GPU architecture 165 We propose an explicit generalized interpolation material point method (GIMPM) implementation (Dong and Grabe, 2018;Wang et al., 2020) in a three-dimensional configuration on a GPU, taking advantage of the efficient vectorized algorithmic structure (Wyser et al., 2020a, b). We select explicit GIMPM implementation, which is valid for a variety of problems compared to other latest variants (Wang et al., 2019;Coombs et al., 2020), i.e., CPDI or CPDI2q. Additionally, we use a double-mapping approach (MUSL, see Nairn 2003;Buzzi et al. 2008), which consists of updating the stress at the end of the time step. We implement the following domain-update methods: a) no update of the material point domain, further labelled uGIMPM, and b) a domain update controlled by the determinant of the deformation gradient, i.e., det(F ij ), further labelled cpGIMPM. These domain-update methods are commonly used in the literature (Baumgarten and Kamrin, 2019;Tran and Sołowski, 2019). The limitation of the two methods is that they are not ideally suited for specific tests: simple stretching and compression modes (Coombs et al., 2020).

175
3.1 Implementation on a graphical processing unit (GPU) Graphical processing units (GPUs) are many-core processors originally designed to refresh screen pixels (e.g., for computer games) independently. A schematic representation of the main architecture differences between a CPU and a GPU is depicted in Fig. 2. On the GPU chip, most of the physical space is dedicated to arithmetic logical units, whereas on a CPU, most of the physical space is dedicated to chip host scheduling and control microsystems. GPUs feature many more cores, a lower 180 thread-scheduling cost and a higher memory bandwidth than CPUs. The programming model is based on a parallel principle called single instruction -multiple data (or SIMD), i.e., every single instruction is executed on different data. GPUs feature a hierarchical structure. The lowest computational unit is the thread. Threads are organized into blocks of threads, the whole constituting a hierarchical grid of blocks of threads. A GPU typically launches thousands of threads, which execute the same instruction in parallel, thus achieving massive parallelism. Additionally, the most recent GPUs offer a high throughput (close 185 to a TB per second peak memory throughput).
Currently, most of the algorithms are memory-bounded, meaning that memory transfers limit the performance, in contrast to computer-bounded algorithms, where floating point (arithmetic) operations limit the performance. Thus, for an efficient implementation of an algorithm, one must consider a) limiting the memory transfers to the bare minimum and b) avoiding complex data structures (Räss et al., 2019a) to benefit from the high throughput capabilities of GPUs. The ability of a GPU 190 is particularly well suited to efficiently execute a large number of local operations in parallel, i.e., single instruction, multiple data (SIMD) programming. In the case of a GIMPM implementation, this includes the calculation of shape functions and the update of various quantities at the material point level (i.e., stresses, domain lengths, material point volumes, etc.). Below, we present key aspects of our GPU-based implementation using the Computed Unified Device Architecture (CUDA C) language of the Nvidia Corporation, which is a syntax extension of the C programming language.
195 Figure 2. Schematic chip representation for both the central processing unit (CPU) and the graphical processing unit (GPU) architecture (Nvidia, 2007). The latter is made of thousands of arithmetic logical units (ALUs). The CPU architecture is primarily dedicated to controlling units and cache memory, and the physical space allowed for ALUs is considerably reduced compared to a GPU architecture.

Algorithm workflow
In our implementation, MATLAB acts as an architect (see Fig. 3). It 1) defines the problem geometry (i.e., the background mesh, material point locations and related quantities, etc.), which can be tedious to initialize in a CUDA C environment. It also calls an external MATLAB script, which compiles the necessary source codes, i.e., gpu.cu or cpu.cu. It further 2) calls either a CUDA C or plain C executable, i.e., gpu.exe or cpu.exe, within a Windows OS to solve for the numerical problem This is a powerful combination between a high-level language such as MATLAB and a performant low-level language such as CUDA C or plain C. It is also easy to invoke system commands directly via MATLAB, i.e., to compile source codes and/or run executables using the built-in command system('...'). We focus on OS-free scripting in MATLAB using a built-in command (i.e., isunix or ispc) to ensure that it performs well under all operating system (OS) architectures. In addition, 205 such a workflow can be easily extended to other high-level languages such as Python.

Kernels and launch configuration
We briefly describe our GPU-based implementation (gpu_main.cu) while focusing mainly on the computational aspects of the implementation. Implementation of an explicit GIMPM solver into the CUDA C language requires dispatching computa-  tional activities into several kernels, i.e., similar to classic functions for a serial implementation in the C language. Each kernel 210 is operated by the GPU only, and kernel launch configuration parameters must be defined for its proper execution. Among them, one must define the number of active threads per block (i.e., the block size) and the number of blocks (i.e., the grid size).
A typical kernel is executed N times in parallel by N distinct threads organized into blocks of threads, i.e., a grid of blocks of threads. The principal hardware limitation is the total number of threads within a block: it cannot exceed 1024 threads per block. One must ensure that the maximal size of a block is lower than or equal to this limit.

215
The computational activities are handled by multiple GPU kernels; 11 kernels are successively launched over a computational cycle. An overall description is given in Fig. 4. A while loop is used to perform the computational cycles, and an MPM step is solved at every cycle. n IO (i.e., the number of accesses to the GPU global memory) is reported in Fig. 4 for each kernel and is estimated by a careful examination of relevant operations within the kernels. Note that all calculations are performed on the GPU, except the calculation of the adaptive time step, which is serially executed by the CPU.

220
In our GPU-based implementation, we define two distinct types of kernel launch parameters: 1) those used for mapping between material points and background nodes (i.e., accumulations and projections between material points with their associated nodes and back and forth) and 2) those used for local calculation at the material point or node level (i.e., update of material point stresses or the solution to the momentum balance equations on the Eulerian background mesh).

Adaptative time step 225
An adaptive time step is implemented. For three-dimensional configurations, the maximum elastic wave speed of the material (Anderson, 1987;Zhang et al., 2017) reads as where c el = ((K + 4G/3)/ρ) 1 2 is the elastic wave speed of the material, K and G are the bulk and shear moduli, respectively, ρ is the material density, and v x,p , v y,p and v z,p are the material point velocity components. The time step ∆t is then restricted 230 by the CFL condition, where α ∈ [0; 1] is the time step multiplier, and ∆x, ∆y, and ∆z are the background mesh resolutions.
This requires evaluation of the maximum velocity of all material points at the beginning of each calculation cycle. We choose to sequentially find the maximum velocity using the CPU instead of a parallel implementation on the GPU. This results 235 in systematic memory transfers between the GPU global memory and the random access memory (RAM) of the CPU. However, we report a low performance loss due to these transfers, i.e., a maximal loss of 2-5 % in performance, which is acceptable.

Back-and-forth mapping between material points and their associated nodes
The GPU-based algorithm relies heavily on the use of arrays p2e and e2n (Wyser et al., 2020a). Elements are numbered with an increasing index. Associated nodes are also numbered in a similar manner. The array e2n of dimension n el × n n ,

240
where n el is the total number of nodes and n n is the number of nodes associated with an element e, describes the topological relation between the elements and the nodes of the mesh. Similarly, the array p2e describes the topological relation between the material points and the element in which they are located. These two arrays provide an intuitive definition of the relations between i) the material points and the nodes they are associated with (i.e., p2n) and ii) the element and their nodes (i.e., e2n).
Then, it is a computationally straightforward process to identify which nodes n are associated with a material point p, which is 245 occupying an element e.
The GPU-based implementation relies on the built-in function atomicAdd() in CUDA C. It performs atomic operations, which avoid the data race of multiple threads, from the same or different blocks to update the same memory location. Atomic However, we propose averaging only the volumetric part of the stress tensor, i.e., the pressure p = − 1 3 σ kk , while its deviatoric part τ ij = σ ij − pδ ij remains unchanged. This results in the following: where v p is the material point's volume. This gives a constant distribution of the pressure field over an element because of its zero-order reconstruction (Lei et al., 2020). The Cauchy stress tensor σ ij,p of a material point p occupying an element e is 265 corrected as where δ ij is the Kronecker delta and (p e ) p is the averaged pressure within an element e and assigned to a material point p.

Available computational resources
The CPU-and GPU-based simulations are performed on a modern workstation running on a Windows 10 operating system The latest CUDA version installed is v11.0, and the supercomputer Octopus is operated under a CentOS 6.9. environment. To summarize the computational resources in use, Table 1 presents the main characteristics of the GPUs used in this study.

Measuring computational performance on a GPU
Omlin (2017) To determine the effective memory throughput, one must estimates (or quantifies) the overall set of memory operations (read-and-write or read only), i.e., n IO , which are needed to resolve a given problem. Consequently, we carefully estimate the minimum number of memory operations while considering a GIMPM-based implementation. This results in the following 290 effective memory throughput: where n p is the arithmetic precision (i.e., single-precision floating-point format FP32 or double-precision floating-point format FP64) and t GPU is the wall-clock time in seconds to complete the n iter iterations to solve for the numerical problem. For three-dimensional problems, we estimate the minimal number of memory operations for an explicit GIMP implementation as where n mp is the number of material points, n n is the number of associated nodes for an element (i.e., n n = 16 in 2D and n n = 64 in 3D), n no is the number of nodes, and n el is the number of elements. Additionally, we also report the count of calculation cycles per second of the GPU, i.e., it·s −1 as well as the wall-clock time. These two metrics give an intuitive sense of the time-to-solution, which is convenient for potential application purposes.

Results
In this section, we present two numerical models using the solver ep2-3De v1.0, namely, 1. Model 1, the granular collapse, which serves as Model 2, the three-dimensional earth slump (Varnes, 1958(Varnes, , 1978, which serves as We investigate the granular collapse of an aluminium-bar assemblage (Bui et al., 2008) under three-dimensional or plane 315 strain configurations. The geometry of the problem is shown in Fig. 5, and its variables are summarized in Table 2 for both three-dimensional and plane strain configurations. Note that for Model 1a, we use the same number of elements along the x−direction n el,x = 80 as in Huang et al. (2015). As a direct comparison for Model 1b under a plane strain configuration, Huang et al. (2015) used n el = 15360, ∆x = ∆z = 2.5 mm and n mp = 25600.
We consider a noncohesive granular material of density ρ = 2650 kg·m −3 , with a bulk modulus K = 0.7 MPa and a Poisson's 320 ratio ν = 0.3, as in Huang et al. (2015). The cohesion is c = 0 Pa, the internal friction angle is φ = 19.8 • with a dilatancy angle i.e., the background Eulerian mesh, and the red volume is the granular material, which is discretized by 8 material points. The total number of background elements n el depends on the number of elements in the x− direction n el,x used to discretize the granular material. ψ = 0 according to Bui et al. (2008). However, the density and stiffness properties have negligible effects on the granular flow dynamics, as reported by Nguyen et al. (2020). We introduce local damping D (see Wang et al. 2016b) to resolve numerical results that are compatible with the experimental results of Bui et al. (2008). We find that D = 0.025 results in the most compatible dynamics. The reasons for the introduction of local damping can be found in Appendix C. Fully fixed boundary 325 conditions (i.e., no slip) are enforced at the bottom and rollers on the sidewalls. The total simulation time is 1.0 s, considering a the time step multiplier α = 0.5. Table 2. Parameters used in Models 1 a & b for the granular collapse. n el,i is the number of elements to discretize the granular material along the i-th direction, n el and nno are the total number of elements and nodes of the background mesh, npe is the number of material points per element and nmp is the total number of material points. Note that the mesh resolution is ∆x = ∆y = ∆z = 2.5 mm. To validate the numerical implementation under a GPU architecture, we first compare it against the well-known granular collapse experiments initially performed by Bui et al. (2008). Here, we present and compare numerical results without focusing 330 on the performance of the GPU-based implementation. All the simulations are performed on a consumer electronics RTX 3090 GPU with double-arithmetic precision (i.e., n p = 8 bytes).
The results from the numerical simulation under a three-dimensional configuration are shown in Fig. 6. A direct and visual comparison demonstrates excellent agreement between the numerical solver and the experiments of Bui et al. (2008). We observe a slightly higher run-out distance, but the overall geometry of both the failure surface and the free surface is very close 335 to the experimental data. We also report an angle of repose of ≈ 13 • . This value is also consistent with the value reported by  The equivalent accumulated plastic strain p eqv is shown in Fig. 7. We observe a coherent deformation of the granular material 340 with a large shear zone that propagates backward from the base of the material to the top of the granular material. The mobilized granular material flows along a principal failure surface. However, the overall deformation pattern is rather coarse, i.e., fine structures or local shear bands are not yet observed, even though slight deformation heterogeneities can be observed. This coarse behaviour of shear banding is also consistent with previous studies (see Huang et al. 2015;Chalk et al. 2020;Zhang et al. 2021). This is mainly due to the background mesh resolution used in the numerical simulation. We further investigate 345 shear banding using a higher background mesh resolution under a plane strain configuration in Model 1b.

Model 1b: the plane strain granular collapse
We investigate granular collapse under a plane strain configuration, as this allows an increase in the number of elements, resulting in an even finer background mesh (see Table 2). For Model 1a, the numerical solution is in agreement with the experimental work of Bui et al. (2008) regarding either the free surface or the failure surface (see Fig. 8). This demonstrates 350 that both the three-dimensional and plane strain configurations are in agreement with each other. An interesting feature of granular collapse is the equivalent accumulated plastic strain (see Fig. 9). The GPU-based implementation allows us to increase both the background mesh resolution and the total number of material points. This results in finer plastic strain localizations, as demonstrated by the various shear bands and their complex interactions during collapse.
Such detailed shear bands are almost impossible to obtain at lower resolutions, which demonstrates the importance of a GPU-355 based implementation to overcome the hardware limitation of a CPU-based implementation, i.e., mainly longer wall-clock times.

Model 2 4.2.1 Settings for Models 2a & 2b
Here, we select a cohesive elastoplastic isotropic material (i.e., a homogeneous or heterogeneous peak cohesion field) with no 360 dilatancy behaviour. It is modelled with a pressure-sensitive Drucker-Prager model with linear strain-softening behaviour. It is well known that the numerical solutions (as in FEM) are mesh-dependent when considering the strain-softening behaviour of   Fig. 9. A shallow granular flow clearly appears, as suggested by the higher values of p eqv . This supports evidence of shallower granular avalanches during collapses. Deeper structures, which result in lower accumulated plastic strains, probably highlight slower deformation modes along well-defined and persistent shear bands. the material. We did not implement techniques to address this issue, but the use of nonlocal plasticity (Galavi and Schweiger, 2010;Burghardt et al., 2012) or viscoplastic formulations (Duretz et al., 2019) are possible ways to address this specific task.
We have chosen an arbitrary geometry (see Fig. 11 and Table 4), which represents an idealized three-dimensional setting, to 365 observe elastoplastic slumps (i.e., earth slumps according to the original classification proposed by Varnes 1958Varnes , 1978, which are now classified as rotational slides in the recent update of the Varnes classification proposed by Hungr et al. (2014). The geometrical setting differs from the one typically used in the literature, as in Zhang et al. (2021). However, it promotes the compression of the toe, which is an expected feature we want to reproduce. The size of the physical domain l z × l x × l y is, at most, 12 m × 64 m × 1024 m for Model 2a, whereas it is 12 m × 64 m × 16 m for Model 2b. 370 We assume this setting features the principal first-order characteristics of a typical rotational earth slump (Varnes, 1958(Varnes, , 1978, i.e., a complex zone of scarps (minor and major) delimiting a crown-like structure, followed by a transition (or depletion) zone in which the material flows homogeneously along internal shear zones due to severe plastic strain localizations and,  We select material properties (i.e., bulk and shear moduli K and G, friction angle φ and peak and residual cohesions c peak and c res ) that result in severe deformation processes and strain localizations. The material properties are presented in Table   3. They are close to the values commonly used in the literature (Wang et al., 2016b, a;Bandara et al., 2016;Zhang et al., 2021). To increase deformations even more, we also introduce a weak layer of thickness 0.3 × l z m at the base of the material 380 with a lower friction angle φ weak . A time step multiplier α = 0.5 is selected, i.e., ∆t min = 1.56 · 10 −2 s is obtained over the whole simulation according to the CFL condition for both Models 2a & 2b. As in Zhang et al. (2021), elastic loading dynamic relaxation is applied for a period of t = 8 s (i.e., Models 2a & 2b), and the elastoplastic behaviour is activated for an additional 7 s, resulting in a total simulation time t = 15 s (i.e., Model 2b only).
Gaussian random fields (see Appendix B) are used to initialize the peak cohesion field c peak , which is parametrized by an 385 average cohesionc peak and its standard deviation σ (see Table 4) along with the residual cohesion c res = c peak /4. This allows us to account for heterogeneities within the material, which lead to complex and heterogeneous displacement fields. We first

Model 2a: relative performances
Here, we investigate the computational performances of the solver ep2-3De v1.0 under a three-dimensional configuration on a variety of GPUs with recent architectures: Ampere, Turing and Volta. Furthermore, we restrict our performance analysis only 395 for the elastic loading phase (i.e., 8 s of simulation) because it is more complex to determine the exact number of material points that are yielding during each computational cycle (see Fig. 4) and to infer the exact effective memory throughput.
All the numerical simulations are performed on the computational resources and GPU hardware presented in Table 1 under double-arithmetic precision (i.e., n p = 8 bytes in Eq. 38). As a reference baseline, we use the performance obtained for a CPU-based single-threaded implementation of ep2-3De v1.0 on an i9-10900K CPU (e.g., latest Intel CPU chip). However, this 400 is not representative of a highly optimized multithreaded implementation under a CPU architecture. We report the effective memory throughput MTP eff of the solver ep2-3De v1.0 on various GPUs and CPUs (see Fig. 12). An increase in the effective memory throughput is observed as the number of material points increases. All GPUs reach a maximum effective throughput, but the Tesla V100 scores a maximum effective throughput of ≈ 650 GB·s −1 . This corresponds to 88 % of its peak throughput (for the GPU's hardware limit, see Table 1). We report a similar observation for the RTX 2080 ti,
The overall results suggest, as in Räss et al. (2019b), that most recent GPUs, such as the data-centre Tesla V100 (Volta), offer significant performances compared to entry-level consumer electronics GPUs, such as the GTX 1650. In terms of absolute 410 performance, the more recent the GPU is, the higher its performance. A demonstration is given by the absolute effective throughput between the RTX 2080 ti and the RTX 3090: The latter achieves an additional 20 % throughput compared to the former. We highly suspect the hardware itself to be the main reason for this. We further investigate the performances of the most recent data-centre GPU, i.e., the A100 (Ampere architecture), with its predecessor the V100 (Tesla architecture). The A100 reaches ≈ 1100 GB·s −1 , which yields a performance gain of 1.6× with respect to the Tesla V100. When compared to 415 the maximum effective memory throughput in Table 1, this correspond to 97 % of the hardware limit. Figure 13. a) Wall-clock time reported for various computing architectures (GPUs and CPU). The differences in the maximal number of material points nmp are due to the on-chip memory limit. A significant difference in terms of wall-clock time is observed between the CPU and GPUs, even for the low-entry consumer electronic GTX 1650, i.e., a performance gain of ≈ 10×. b) Performance gains of GPUs relative to the CPU, i.e., 1× as a baseline. We add the CPU and the GTX 1650 wall-clock time for an easier comparison.
Finally, we report the wall-clock time for various computing architectures (see Fig. 13a). As expected by the maximum effective memory throughput, A100 delivers the fastest solution, regardless of the number of material points n mp . The A100 GPU resolves a geometry of n mp ≈ 3.2 · 10 6 in less than a minute (29 seconds), whereas the i9-10900K CPU resolves the same problem in more than an hour (5949 seconds). This corresponds to a 200× performance gain (123× performance gain 420 for the V100, see Fig. 13b) compared to the CPU-based implementation of ep2-3De v1.0. The RTX 2080 ti and the RTX 3090 reach a 60× and 77× performance gain, respectively. However, the entry-level GTX 1650 is only ten times faster than i9-10900 K. As already shown in Fig. 12a, these performance gains are only expected when the different GPUs reach their maximum effective memory throughput. In terms of runtime, the performance gain (Fig. 13b) is in agreement with the memory throughputs reported in Fig. 12a.  Table 4. In the following, we present the main results for the three peak cohesion fields, and we discuss the main characteristics obtained for typical slumping mechanics.

Homogeneous peak cohesion field
The homogeneous solution gives preliminarily interesting results (see Fig. 14). The first-order characteristics of a slump can be observed, even though their magnitude is relatively fair compared to the real slump. The most striking feature is the devel-435 opment of one major shear zone, along which the material flows (i.e., depletion) towards the toe of the slump, resulting in a compression zone (i.e., thrusting and folding deformations). The crown-like structure develops linearly along the y−direction and is highly localized at the surface of the slump (at x ≈ 20 m in Fig. 14). However, the material flows homogeneously along the x−direction (see the vertical profile in Fig. 14), as shown by the displacement field. The lateral variation of the displacement field (along the y−direction) is almost nonexistent, which is mainly due to the spatial homogeneity of the peak cohesion 440 field.

Isotropic Gaussian covariance
Considering heterogeneities with a Gaussian covariance function for the cohesion field, the displacement field starts to resolve a differential behaviour (see Fig. 15). Higher and/or weaker values of the peak cohesion field yield lower and/or greater displacements. This is obvious, especially in the transition zone where this differential is observable. In addition, the compression 445 zone also starts to resolve spatial variations due to weaker and stronger cohesion values.
A striking difference is the shear zone itself (see Fig. D2): the shear zone exhibits a more complex spatial pattern, whereas only one major shear zone is observed in Fig. D2. Retrogressive shear banding appears during the time evolution of the slump, which suggests the development of a secondary shear zone within the slump. Moreover, the crown-like structure is now curved and not linear along the y−direction. Its spatial extent is more important and is not as localized as in the homogeneous 450 case. Nevertheless, a more complex arrangement of major and minor scarps within the crown-like structure has not yet been observed. Such a structure is more evident if one observes the accumulated equivalent plastic strain p eqv in Fig. D2 in Appendix D. Figure 14. Displacement field obtained after t = 15 s for a homogeneous peak cohesion field. One can see an overall homogenous displacement field with some of the first-order characteristics of a slump, i.e., a rotational displacement with a compression zone at the toe, a transition zone delimited by one principal shear zone and a major scarp at the top of the material. The high magnitude of the displacement field in the areas x ∈ [20; 40] m and y ≥ 8 m is due to a weaker zone in the peak cohesion field (see Fig. D2). This shows a strong influence of the heterogeneous peak cohesion field on the final displacement 455 field. A lower shear strength of the material yields faster strain-softening behaviour, promoting a faster response of shear banding.

Isotropic exponential covariance
Shear banding activities become even more complex when an exponential covariance function is used, relative to Fig. 14 and even with Fig. 15 to some extent. The spatial distribution of the peak cohesion (see Fig. D3) resolves finer heterogeneities with 460 a smaller length scale compared to when Gaussian covariance is used. Principal differences are observed at the top and toe of the slump, where the crown-like structure turns into a complex zone made of minor and major scarps (see Fig. 16). The displacement field becomes highly heterogeneous, particularly at the toe and the top of the slump. However, it is also more homogeneous when compared with Fig. 15 The difference between the Gaussian and exponential covariance of the peak cohesion suggests the following. Heterogeneous displacement fields could be influenced by larger and/or coarser fluctuations of the shear strength within the material. By extrapolation, this could imply that the magnitude of the heterogeneity might be related to the fluctuation scales of the peak cohesion field. Locally rather homogeneous fluctuations of the peak cohesion (i.e., Gaussian covariance) seem to promote an increasingly heterogeneous displacement field at the surface. The characteristic length scale of spatial fluctuations could 470 have important implications for highly heterogeneous displacements within landslides. The same assumption could hold for understanding the more complex crown-like structure of slumps (see Fig. D3)

GIMPM suitability
We investigated granular collapses in both three-dimensional and plane strain configurations. Our numerical results demon-475 strated the suitability of GIMPM to correctly reproduce experimental granular collapses. They also demonstrated that the results did not significantly differ between these two spatial configurations and that both approaches give similar numerical solutions.

Collapse limitation
For Model 1a, the principal hardware limit is the on-chip memory of the GPU. Even though RTX3090 is supported by 24 480 GB DDR4, it is physically impossible to achieve the resolution used for plane strain granular collapse. This would require more than 24 GB of on-chip memory. However, a multi-GPU implementation using the message passing interface API (MPI) could resolve this hardware limitation when using only one GPU. As demonstrated by Räss et al. (2019a); Alkhimenkov   Regarding the performance for Model 1b, the wall-clock time is t GPU = 1470.5 s (25 min), and the number of iterations per second is 85.5 it·s −1 for n mp = 819200. As a preliminary example, the same numerical model was performed by Wyser et al. (2020a), who reported 19.98 it·s −1 for n mp = 12800. Proportionally, this corresponds to a performance gain factor of 275 for the GPU-based implementation (ep2-3De v1.0) over the MATLAB-based implementation (fMPMM-solver v1.1) (Wyser et al., 2020a).

Performance
The performance analysis we carried out in Model 2a demonstrated that even though the algorithm heavily relies on atomic operations to accumulate material point quantities on the nodes, the effective memory throughput reaches 88 % at most (for Tesla V100). We expected a much lower throughput due to the use of these atomic operations, since they are likely known to undermine the computational performances of an implementation under previous GPU architectures (e.g., Kepler) (Dong 500 et al., 2015a;Dong and Grabe, 2018;Gao et al., 2018). Our actual understanding (at least for a GPU-based implementation of GIMPM) is that the latest GPU architecture (Ampere and Turing) is now efficient when dealing with atomic operations and that the need to use a complex data layout for scattering is not as important as before. Furthermore, we identify the memory throughput as the main bottleneck: an additional 12 % performance improvement on the V100 before reaching the hardware limit of the memory bandwidth. The A100 shows that the solver reaches the hardware limit with an effective memory 505 throughput which is very close (i.e., 97 %) to the actual maximum memory throughput. Similarly, the true limiting factor of our implementation is the hardware limit of the GPU on-chip memory. Further development efforts should be directed towards an MPI implementation of ep2-3De v1.0.

Slumping mechanics
We show the application of the GIMPM solver ep2-3De v1.0 for slumping mechanics. We have presented various slump results 510 and demonstrated the significant influence of heterogeneities within the peak cohesion field over the displacement field or the equivalent plastic strain. However, we have arbitrarily selected values that resulted in severe deformations of the material, which we wanted to highlight to demonstrate the potential of the solver. Further efforts should now be oriented towards numerical models that are closer to real and well-documented cases, such as in Tran and Sołowski (2019); Ying et al. (2021). Despite the simplifications we made, we have reported three-dimensional simulations that resolve all the first-order characteristics of 515 slumps, including complex major and minor scarps, different shear zones of various activities and a complex arrangement within the compression zone. The use of three-dimensional GIMPM implementation under a GPU architecture will highly benefit future studies in the field, allowing faster and detailed numerical simulations of heterogeneous and complex strain localization problems.

Code portability 520
Our numerical models showed the efficient computing capabilities of modern GPUs under the latest Nvidia GPU architectures.
An important concern is the code portability. CUDA C is only applicable for Nvidia's GPUs and is not yet compatible with other corporation's GPUs, such as AMD (ATI Technologies). As such, an extension of the ep2-3De v1.0 solver towards an OpenCL-based implementation would ensure better code portability in the future.

525
We developed ep2-3De v1.0, an explicit GPU-based implementation of the generalized interpolation material point method that exploits the capabilities of the most recent GPU architectures (Ampere, Turing and Volta). We achieved fast execution times on a single GPU with a scattering approach that relies on extensive usage of atomic operations. We report, at most, an effective memory bandwidth of 88 % relative to the maximal hardware capabilities of the GPUs. We achieve, at most, a performance gain of 200× compared to a single-threaded CPU-based implementation of the solver. On entry-level customer 530 electronics GPUs, we report a performance gain of ≈ 10×. We also report that the memory bandwidth is the main limiting performance factor. We validated our solver against the well-known experimental results of the granular collapse problem in a three-dimensional configuration. Furthermore, we show applications of the solver to model slumping mechanics considering different material heterogeneities.
Recent GPU architectures (Ampere, Turing and Volta) have certainly been optimized by Nvidia, increasing the efficiency 535 of native atomic operations (i.e., scattering), as was suggested before by previous studies. This is encouraging, and future implementations on GPUs might be more straightforward and could now focus on using atomic operations instead of complex parallel implementations to avoid race conditions between threads without suffering from a dramatic loss of computational performance.
The single GPU implementation we propose here should be completed in the future by taking advantage of the message 540 passing interface (MPI) and the computing power of a multi-GPU implementation to overcome the other limiting factor of ep2-3De v1.0, the GPU on-chip memory. Having such an implementation ensures that one can use a computationally powerful explicit GIMP solver to address complex elastoplastic problems in the field of geomechanics.
To solve for this instability, Bardenhagen and Kober (2004) introduced the generalized interpolation material point method (GIMPM). Whereas the material point is treated as a point in sMPM, Bardenhagen and Kober (2004) assigned a spatial extent or a domain to the material point. Alternative basis functions are constructed, i.e., to consider the material point domain, as follows:

555
where v p is the material point volume, Ω p denotes the material point domain, χ p (x) is the particle characteristic function, N n (x) is the basis function (or shape function) for the mapping between the material point p and its associated nodes n, and x = x p − x n are the local coordinates between node n and material point p.
The particle characteristic function must satisfy the partition of unity property, i.e., p χ p (x) = 1 ( Bardenhagen and Kober, 2004). The simplest particle characteristic function is given by the hat function, i.e., The GIMPM basis functions and derivatives are constructed analytically (Coombs et al., 2020;Charlton et al., 2017) in one dimension from a convolution of the standard finite element basis functions and the material point characteristic function (Steffen et al., 2008), i.e.,

565
where l p is the length of the material point domain, h is the mesh resolution, and x = x p − x n , where x p is the coordinate of a material point and x n is the coordinate of its associated node n. The two-dimensional basis function of a node n with its material point p is constructed as φ np ≡ φ n (x p ) = φ n (x p )φ n (y p ), for which the gradient is defined as 570 ∇φ np ≡ ∇φ n (x p ) = (∂ x φ n (x p )φ n (y p ), φ n (x p )∂ y φ n (y p )).
function, characterised by fluctuation scales in a vector format, i.e., λ = (λ x , λ y , λ z ). In regard to numerical modelling, the principal requirement is that both small and large scales are simultaneously resolved over the computational mesh to ensure physically meaningful solutions.
Recently, Räss et al. (2019b) presented an efficient implementation based on a spectral representation of Gaussian random fields for geophysical applications using either Gaussian or exponential covariance functions. The numerical codes, named 580 GRFS, were made available by Räss et al. (2019b) in both native MATLAB and CUDA C languages 3 . However, a sufficiently large number of harmonics should be used to obtain convergent Gaussian random fields, as stated in Räss et al. (2019b).
Similar to the random material point method (RMPM, see Wang et al. 2016a;Liu et al. 2019;Remmerswaal et al. 2021) initially proposed by Fenton and Vanmarcke (1990) to generate RFs for a finite element mesh (RFEM), we combined this approach with the codes proposed by Räss et al. (2019b) to generate an isotropic peak cohesion field to demonstrate its 585 influence on the mechanical behaviour.

Appendix C: Volumetric locking and damping corrections
In Huang et al. (2015), no volumetric locking mitigation strategy was introduced, even though tough low-order elements were used. This should result in severe volumetric locking issues and an overall stiffer response of the granular material. In addition, Huang et al. (2015) used the standard (or original) material point method (instead of the generalized interpolation material 590 point method), which is well known to introduce spurious oscillations of internal forces (González Acosta et al., 2020).
When implementing the proposed volumetric locking mitigation strategy, we observed a) larger deformations of the granular material with a stronger vertical compaction (i.e., stronger vertical displacement) and b) slightly longer run-out distances when compared to the experimental data. The softer mechanical response of the granular material had to be compensated somehow, which can be achieved by the introduction of a small local damping parameter. 595 We reproduced the numerical setting used in Huang et al. (2015) with the same mesh resolution, i.e., ∆x = ∆y = 2.5 mm, and a similar number of material points n mp = 28800 with an initial number of material points per initially filled element n pe = 9. The material parameters used for this preliminary investigation are presented in §4.1.1. Figure C1 a) & b) shows the major differences between either a locking-free or a locking-prone solution and the experimental results. As mentioned before, a slightly longer run-out distance is obtained for the locking-free solution. As a result, the 600 numerical prediction given by the locking-free solution of the free surface is underestimated. However, the most noticeable difference is the failure surface. Whereas the failure surface predicted by the locking-prone solution fits with the experiment of is underestimated by the locking-free solution compared to the experimental results. This is due to the softer response of the granular material when volumetric locking is mitigated, which promotes greater vertical compaction and stronger run-out 605 distance at the same time. Even though the introduction of local damping better resolves the experimental results, one can argue that the locking-free solution without the introduction of local damping still agrees with the experiment of Bui et al. (2008). The overall response of the numerical granular collapse is still very close to the actual physical experiment, and the differences between the numerical and experimental results can still be considered acceptable.
610 Figure D2. Heterogeneous cohesion field with a Gaussian covariance function: time evolution of the equivalent plastic strain p eqv . Unlike Fig. D1, heterogeneous behaviour is observed, i.e., the appearance of a second shear zone highlights a more complex deformation pattern.
Moreover, a crown-like structure starts to develop at the top of the material, where an initial weak zone is located.