Assessing the robustness and scalability of the accelerated pseudo-transient method towards exascale computing

. The development of highly efﬁcient, robust and scalable numerical algorithms lags behind the rapid increase in massive parallelism of modern hardware. We address this challenge with the accelerated pseudo-transient iterative method and present here a physically motivated derivation. We analytically determine optimal iteration parameters for a variety of basic physical processes and conﬁrm the validity of theoretical predictions with numerical experiments. We provide an efﬁcient numerical implementation of pseudo-transient solvers on graphical processing units (GPUs) using the Julia language. We 5 achieve a parallel efﬁciency over 96% on 2197 GPUs in distributed memory parallelisation weak scaling benchmarks. 2197 GPUs allow for unprecedented terascale solutions of 3D variable viscosity Stokes ﬂow on 4995 3 grid cells involving over 1.2 trillion degrees of freedom. We verify the robustness of the method by handling contrasts up to 9 orders of magnitude in material parameters such as viscosity, and arbitrary distribution of viscous inclusions for different ﬂow conﬁgurations. Moreover, we show that this method is well suited to tackle strongly nonlinear problems such as shear-banding in


Introduction
The recent development of multi-core devices has lead to the democratisation of parallel computing.Since the "memory-wall" in the early 2000's (Wulf and McKee, 1995), the continuous increase of the ratio between computation speed and memory access speed results in a shift from compute-bound to memory-bound algorithms.Currently, multi-core devices such as graphical processing units (GPUs) feature floating point arithmetic processing rates outperforming memory access rates by over an https://doi.org/10.5194/gmd-2021-411Preprint.Discussion started: 13 January 2022 c Author(s) 2022.CC BY 4.0 License.finding (Barnes, 1999).The DR terminology is still reference in the finite-element method (FEM) community (Rezaiee-Pajand et al., 2011).
Interestingly, Richardson developed his iterative approach without being aware of the work of Gauss and Seidel, their method being named the Liebmann method when applied to solving PDEs.Early development of iterative algorithms such as one-dimensional projection methods and Richardson iterations depend on the current iterate only; they were well-suited for early low memory computers, however lacking in efficient convergence rates.The situation changed in 1950, when Frankel introduced second-order iterations as extension of the Richardson and Liebmann methods, adding dependency on the previous iterate (Frankel, 1950), resulting in the second-order Richardson and extrapolated Liebmann methods, respectively.These methods feature enhanced convergence rates (Young, 1972), and perform on par, the first being slightly more interesting as fully local (Riley, 1954).By analogy with the explicit solution to time-dependent PDEs, Frankel introduced additional "physically motivated" terms in his iterative scheme.Since the Chebyshev iteration can be recovered for constant parameters, second-order or extrapolated methods are also termed semi-iterative.Note that one challenge related to Chebyshev semi-iterative methods relies in the need of an accurate estimate of extremal eigenvalues relating to the interval in which the residual is minimised.
The review by Saad (2020) provides further interesting development insights.
The accelerated PT method for elliptic equations is mathematically equivalent to the second-order Richardson rule (Frankel, 1950;Riley, 1954;Otter et al., 1966).The convergence rate of PT methods is very sensitive to the iteration parameters' choice.
For the simplest problems, e.g., the stationary heat conduction in a rectangular domain described by the Laplace's equation, these parameters can be derived analytically based on the analysis of the damped wave equation (Cox and Zuazua, 1994).
In the general case, the values of these parameters are associated with the maximum eigenvalue of the stiffness matrix.The eigenvalue problem is computationally intensive, and for practical purposes the eigenvalues are often approximated based on the Rayleigh's quotient or Gershgorin's theorem (Papadrakakis, 1981).Thus, the effective application of PT methods relies on an efficient method to determine the iteration parameters.In the last decades, several improvements were made to the stability and convergence rate of DR methods (Cassell and Hobbs, 1976;Rezaiee-Pajand et al., 2011;Alamatian, 2012).Determining the general and efficient procedure for estimating the iteration parameters still remains an active area of research.
We identify among current challenges for iterative methods three important ones, namely (1) ensure the iteration count to scale linearly with numerical resolution increase, possibly independently of material parameters contrasts and nonlinearities, (2) achieve minimal per-device main memory access redundancy at maximal access speed, and (3) achieve a parallel efficiency close to 100% on multi-device -distributed memory-systems.In this study, we address (1) by presenting the accelerated PT method, resolving several types of basic physical processes.We consider (2) and (3) as challenges partly related to scientific software design and engineering; we address them using the emerging Julia language (Bezanson et al., 2017), which solves the "two-language problem" and provides the missing tool towards making prototype and production code become one and breaking up the technically imposed hard division of the software stack into domain science tasks (higher levels of the stack) and computer science tasks (lower levels of the stack).The Julia applications featured in this study rely on recent Julia package developments undertaken by the authors to empower domain scientists to write architecture-agnostic high-level code for parallel high-performance stencil computations on massively parallel hardware such as latest GPU-accelerated supercomputers.In this work, we present the results of analytical analysis of the pseudo-transient equations for (non-)linear diffusion and incompressible visco-elastic Stokes flow problems.We motivate our selection of particular physical processes as a broad range of natural processes categorise mathematically either as diffusive, wave-like or mechanical processes and thus constitute the main building blocks of multi-physics applications.We derive iteration parameters' approximations from continuous, non-discretised formulations with emphasis on analogy between these parameters and non-dimensional numbers arising in mathematical modelling of physical processes.Such a physics-inspired numerical optimisation approach has the advantage to provide a framework building upon solid classical knowledge and for which various analytical approaches exist to derive or optimise parameters of interest.We assess the algorithmic and implementation performance and scalability of the 2D and 3D numerical Julia (multi-)GPU (non-)linear diffusion and visco-elastic Stokes flow implementations.We report scalability beyond terrascale number of DoFs on up to 2197 Nvidia Tesla P100 GPUs on the Piz Daint supercomputer at the Swiss National Supercomputing Centre, CSCS.We demonstrate the versatility and the robustness of our approach in handling nonlinear problems by applying the accelerated pseudo-transient method to resolve spontaneous strain localisation in elasto-viscoplastic media in 2D and 3D, and comparing time to solution with direct-sparse solvers in 2D.We further demonstrate the convergence of the method to be mostly insensitive to arbitrary distributions of viscous inclusions with viscosity contrasts up to 9 orders of magnitude in the incompressible viscous Stokes flow limit.

The pseudo-transient method
At the core of the pseudo-transient method lies the idea of considering stationary processes, often described by elliptic PDEs, as the limit of some transient processes described by parabolic or hyperbolic PDEs.
Pseudo-transient methods were present in literature since 1950's (Frankel, 1950) and have a long history.However, the equations describing processes under consideration are usually analysed in discretised form with little physical motivation.
We here provide examples of pseudo-transient iterative strategies relying on physical processes as a starting point both for diffusion and incompressible visco-elastic Stokes problems.We further discuss how the choice of transient physical processes influences the performance of iterative methods and how to select optimal iteration parameters upon analysing the equations in their continuous form.
In the following, we make two assumptions: 1.The computational domain is a cube 2. This domain is discretised with a uniform grid of cells.The number of grid cells is the same in each spatial dimension and is equal to n x .However, in practice, this solution strategy is not restricted to cubic meshes with similar resolution in each dimension.

Diffusion
Let us first consider the diffusion process: where H is some quantity, D is the diffusion coefficient, ρ is a proportionality coefficient and t is the physical time.
By substituting Eq. ( 2) into (1) we obtain an equation for H: which in case of D = const is, e.g., the standard parabolic heat equation.Equation (3) must be supplemented with initial conditions at t = 0 and two boundary conditions for each spatial dimension at x k = 0 and x k = L.We here assume that Dirichlet boundary conditions are specified.The choice of the boundary conditions type affects only the values of optimal iteration parameters and does not limit the generality of the method.
Firstly, we consider a stationary diffusion process, which is described by Eq. (3) with ∂H/∂t → 0: Solving Eq. (4) numerically using conventional numerical methods would require to assemble a coefficient matrix and to rely on a direct or iterative sparse solver.Such approach may be preferred for 1D and some 2D problems, but since our aim is largescale 3D modelling, we are interested in matrix-free iterative methods.In the following, we describe two of such methods, both of which are based on transient physics.

The first-order pseudo-transient method
Solution to Eq. ( 4) is achieved as a limit of the solution to the transient Eq.
(3) at t → ∞.Therefore, the natural iteration strategy is to integrate the system numerically in time until convergence, i.e., until changes in H, defined in some metric, are smaller then a predefined tolerance.
The simplest pseudo-transient method is to replace physical time t in Eq. ( 15) by numerical pseudo-time τ , and the physical parameter ρ with a numerical parameter ρ: We refer to τ as the "pseudo-time" because we are not interested in the distributions of H at particular values of τ , therefore, τ is relevant only for numerical purposes.The numerical parameter ρ can be chosen arbitrarily.
The number of iterations, i.e., the number of steps in pseudo-time required to reach convergence of the simplest method described by Eq. ( 5), is proportional to n 2 x (see Sect.A1 in the Appendix).Quadratic scaling makes the use of the simplest pseudo-transient method impractical for large problems.
One possible solution to circumvent the poor scaling properties of this first-order method would be to employ an unconditionally stable pseudo-time integration scheme.However, that would require solving systems of linear equations, making the solution cost of one iteration equal to the cost of solving the original steady-state problem.We are thus interested in a method that is not significantly more computationally expensive than the first-order scheme but that offers an improved scalability.

The accelerated pseudo-transient method
One of the known extensions to the classical model of diffusion incorporates inertial terms in the flux definition (Chester, 1963).That addition makes it possible to describe wave propagation in otherwise diffusive processes.Those inertial terms are usually neglected because the time of wave propagation and relaxation is small compared to the characteristic time of the process (Maxwell, 1867).The modified definition of the diffusive flux, originally derived by Maxwell from kinetic theory of ideal gas, takes the following form: where θ r is the relaxation time.
A notable difference between the flux definition from Eq. ( 6) and ( 2) is that the resulting system type switches from parabolic to hyperbolic and describes not only diffusion, but wave propagation phenomena as well.Combining Eq. ( 1), replacing t with τ , and (6) to eliminate q yields which for D = const is a damped wave equation that frequently occurs in various branches of physics (Pascal, 1986;Jordan and Puri, 1999).Contrary to the parabolic Eq. ( 5), the information signal in the damped wave equation propagates at finite speed V p = D/ρ/θ r .
The Eq. ( 7) includes two numerical parameters, ρ and θ r .The choice of these parameters significantly influences the performance and the stability of the pseudo-transient method.Converting the Eq. ( 7) to non-dimensional form allows to reduce the number of free parameters to only one non-dimensional quantity Re = ρV p L/D, which can be interpreted as a Reynolds number.
Another restriction on the values of iteration parameters arises from the conditions for the numerical stability of the explicit time integration.The numerical pseudo-time step ∆τ is related to the wave speed V p via the following stability condition: where ∆x = L/n x is the spatial grid step and C is a non-dimensional number determined for the linearised problem using a von Neumann stability analysis procedure.For the damped wave equation, Eq. ( 7), here considered, C ≈ 1/ √ n d , where n d is the number of spatial dimensions (Alkhimenkov et al., 2021a).
We choose parameters ρ and θ r so that the stability condition (Eq.( 8)) is satisfied for an arbitrary ∆τ .We introduce the numerical velocity V = C∆x/∆τ , where C ≤ C is an empirically determined parameter.We conclude from numerical experiments that using C ≈ 0.95C is usually sufficient for stable convergence, however, for significantly nonlinear problems, lower values of C may be specified.Expressions for ρ and θ r are obtained by taking into account the definition of Re and solving for Depending on the value of the parameter Re, the pseudo-transient process described by the damped wave equation (Eq.( 7)) will be more or less diffusive.In case of Re → ∞, diffusion dominates, resulting in the accelerated pseudo-transient method to be equivalent to the first-order method described in the Sect.2.1.1,regaining the non-desired quadratic scaling of the convergence rate.If, instead, Re → 0, the system is equivalent to the undamped wave equation, resulting in a never converging method, because waves do not attenuate.An optimal value of Re exists between these two limits, which leads to the fastest convergence.
To estimate the optimal value of Re, we analyse the spectral properties of Eq. ( 7).The solution to the damped wave equation is decomposed into a superposition of plane waves with particular amplitude, frequency, and decay rate.Substituting a plane wave solution into the equation yields the dispersion relation connecting the decay rate of the wave to its frequency and values of Re.Considering the solutions to this dispersion relation, it is possible to determine the optimal value of Re, denoted here as Re opt .For near-optimal values of Re, the number of iterations required for the method to converge exhibits linear instead of quadratic dependence on the numerical grid resolution n x , which is a substantial improvement compared to the first-order pseudo-transient method.
We present detailed explanations and derivations of the dispersion analysis of different problems in the Appendix A, leading to the optimal value of Re: We quantify the convergence rate by the number of iterations n iter required to reduce the maximal deviation of the solution to the pseudo-transient equation from the true solution to the corresponding stationary problem by a factor of e (base of natural logarithm), divided by the number of grid cells n x .Results of the dispersion analysis for the 1D stationary diffusion problem show n iter ≈ 0.3n x given optimal values of Re (Fig. 1a).We estimate that reduction of the residual by 14 orders of magnitude requires only ∼ 10n x iterations.
For simplicity we only consider the case D = const in the dispersion analysis.In that case, both ρ and θ r are constant.If the physical properties vary in space, i.e., D = D(x k ), the iteration parameters are no longer constant and must be locally defined by the value corresponding to each grid point.If the distribution of D is smooth, this approximation works well in practice and the number of iterations is close to the theoretically predicted value.However, particular care is needed when the distribution of D is discontinuous in space to avoid significantly reduced values of C to be required for convergence, ultimately leading to a much higher number of pseudo-transient iterations.We found that taking a local maximum of D between neighbouring grid cells in the definitions of iteration parameters, Eq. ( 9) and (10), is sufficient to ensure optimal convergence.The per grid point local maximum selection thus effectively acts as a preconditioning technique.
For nonlinear and/or complex flow problems the corresponding optimal values of iteration parameters such as numerical Reynolds number Re may be determined by systematic numerical experiments.In practice, optimal values for iteration parameters do not differ significantly from theoretical predictions derived for the linear case (see Sect. 2.1.3).Most importantly, the linear scaling of the method is still preserved.Also, the accelerated pseudo-transient method admits explicit numerical integration, and can be implemented with minimal modifications of the most simple pseudo-transient method.

Diffusion-reaction
The next example addresses stationary diffusion processes coupled with reaction.Here, we assume that reaction is described by the first-order kinetics law: where H eq is the value of H at equilibrium, and θ k is a characteristic time of reaction.The flux q in Eq. ( 12) is governed by either Eq. ( 2) or (6).The addition of a source term doesn't change the type of PDE involved in the method formulation, and the iteration strategy based on the discretisation of Eq. ( 2) still exhibit a quadratic scaling.We therefore focus only on the analysis of the accelerated pseudo-transient method.Equations ( 12) and ( 6) reduce to the following equation governing the evolution of H: The Eq. ( 13) differs from the damped wave equation (Eq.( 7)) in that it includes the source term, and additional physical parameters θ k and ρ.In the non-dimensional form, all parameters of Eq. ( 13) can be reduced to two non-dimensional numbers: Re, defined equivalently to the stationary diffusion case, and Da = ρL 2 /D/θ k , a new parameter, which can be interpreted as a Damköhler number characterising the ratio of characteristic diffusion time to reaction time scale.Contrary to the numerical parameter Re, Da depends only on the physical parameters and cannot be arbitrarily specified.
We present the detailed dispersion analysis for the stationary diffusion-reaction problem in Sect.A3 of the Appendix.Parameters ρ and θ r are defined according to Eq. ( 9) and ( 10), respectively, and by analogy to the stationary diffusion case.Optimal values of Re depend now on the parameter Da: We report the result of the dispersion analysis for the diffusion-reaction case as the number of iterations required for an e-fold residual reduction n iter per grid point n x as a function of Re and Da, highlighting Re opt as a function of Da by a dashed line (Fig. 1b).In the limit Da → 0, i.e., when the characteristic time of reaction is infinitely large compared to the characteristic time of diffusion, Re opt → 2π, which is the optimal value for the stationary diffusion problem discussed in Sect.2.1.2.In that limit, the number of iterations required for an e-fold residual reduction n iter is also equivalent to the stationary diffusion problem.
However, as Da increases, the process becomes progressively more reaction-dominated and the pseudo-transient iterations converge accordingly faster.

Transient diffusion
It is possible to apply the pseudo-transient method not only to the solution of stationary problems, but also to problems including physical transient terms.This method is known in the literature as the "dual-time", or "dual-timestepping" method (Gaitonde, 1998;Mandal et al., 2011).
According to the dual-time method, both physical and pseudo-time derivatives are present in the equation: The discretisation of the physical time derivative ∂H/∂t in Eq. ( 15) using a first-order backward Euler scheme, leads to: where H is the distribution of H at the explicit layer of the integration scheme, and ∆t is the physical time step.Comparing Eq. ( 16) and ( 12) shows the two equations to be mathematically identical.Therefore, the optimal iteration parameters given by https://doi.org/10.5194/gmd-2021-411Preprint.Discussion started: 13 January 2022 c Author(s) 2022.CC BY 4.0 License.
Eq. (A15) apply for the transient diffusion as well.The Da parameter thus equals ρL 2 /D/∆t and characterises the fraction of the domain traversed by a particle transported by a diffusive flux during time ∆t.The optimal value of Re is then defined as Frequently, modelling of certain processes requires relatively small time steps in order to capture important physical features, e.g., shear-heating induced strain localisation (Duretz et al., 2019a) or spontaneous flow localisation in porous media (Räss et al., 2019a).In that case, values of Da can be very large.Also, every step of numerical simulation serves as a good initial approximation to the next simulation step, thereby reducing error amplitude E 1 in Eq. (A16).

Incompressible viscous shear-driven Couette flow
Before considering incompressible Stokes equations, we present an illustrative example of shear-driven flow to demonstrate a similarity between already discussed cases addressing generalised diffusion and viscous fluid flow.
We here consider stationary fluid flow between two parallel plates separated by a distance L. We assume the absence of pressure gradients in directions parallel to the plates.In that case, Stokes equations reduce to the following system: where τ xi is the deviatoric shear stress, v x is the velocity parallel to the plates, and µ s the shear viscosity.
The steady state process described by Eq. ( 18) and ( 19) can be converted to a pseudo-transient process similar to the one presented in Sect.2.1.2,by considering inertial term in the momentum equation (Eq.( 18)) and Maxwell visco-elastic rheology as a constitutive relation for the viscous fluid (19): Here, ρ and G are numerical parameters, interpreted as density and elastic shear modulus, respectively.The system of equations (Eq.( 20) and ( 21)) is mathematically equivalent to the system of equations (Eq.(1) and Eq. ( 6)), describing pseudo-transient diffusion of the velocity field v x .The relaxation time θ r represents in that case the Maxwell relaxation time, and is equal to https://doi.org/10.5194/gmd-2021-411Preprint.Discussion started: 13 January 2022 c Author(s) 2022.CC BY 4.0 License.

Incompressible viscous Stokes equation
The next example addresses the incompressible creeping flow of a viscous fluid, described by Stokes equations: where τ ij is the deviatoric stress, p is the pressure, δ ij is the Kronecker delta, f i is the body forces, v the velocity, and εij is the deviatoric strain rate.
Similarly to the shear-driven flow described in Sect.2.2, a solution to the system (Eq.( 22)-( 24)) can be achieved by pseudotransient time integration described by: Equation ( 25) and ( 26) now both include pseudo-time derivatives of velocity and pressure, and become an inertial and acoustic approximation to the momentum and mass balance equations, respectively.The additional parameter K arising in Eq. ( 26) can be interpreted as a numerical or pseudo-bulk modulus.
We use the primary, or P-wave velocity, as a characteristic velocity scale for the Stokes problem: In addition to the non-dimensional numerical Reynolds number, here defined as Re = ρV p L/µ s , we introduce the ratio between the bulk and shear elastic modulus r = K/ G.
By analogy to previous cases, substituting V p = V and solving for the numerical parameters ρ, G and K yields: Similarly to the diffusion-reaction problem studied in Sect.2.1.3,there are two numerical parameters controlling the process.
However, in Stokes equations, both parameters are purely numerical and could be tuned to achieve optimal convergence rate.The dispersion analysis for 1D linear Stokes equations is detailed in Sect.A4 of the Appendix.We provide the resulting optimal values to converge the 2D Stokes problem: because they differ from the 1D case values, and because we consider 2D and 3D Stokes formulation in the remaining of this study.
In the numerical experiments, we consistently observe faster convergence with slightly higher values of r ≈ 1, likely caused by the fact that some of the assumptions made for 1D dispersion analysis do not transfer to the 2D formulation (Fig. 1c).Thus, the values presented in Eq. (31-32) should be only regarded as an optimal iteration parameters estimate.

Incompressible visco-elastic Stokes equation
The last example addresses the incompressible Stokes equations accounting for a physical visco-elastic Maxwell rheology: where G is the physical shear modulus.
As in the transient diffusion case presented in the Sect.2.1.4,the problem can be augmented by pseudo-transient time integration using a dual-timestepping approach: Collecting terms in front of τ ij and ignoring τij because it doesn't change between successive pseudo-transient iterations, one can reduce the visco-elastic Stokes problem to the previously discussed viscous Stokes problem by replacing the viscosity in the Eq. ( 28) with the effective "visco-elastic" viscosity: The conclusions and optimal parameters' values presented in Sect.2.3 remain thus valid for the visco-elastic rheology as well.

Performance and scaling
Assessing the performance of iterative stencil-based applications is two-fold, here reported in terms of algorithmic and implementation efficiency.
The accelerated pseudo-transient method provides an iterative approach that ensures linear scaling of the iteration count with increase in numerical grid resolution n x (Sect.2) -the algorithmic scalability or performance.The major advantage in the design of such an iterative approach is it's concise implementation, extremely similar to explicit time integration schemes.Explicit stencil-based applications, such as, e.g., elastic wave propagation, can show optimal performance and scaling on multi-GPU configurations because they can keep memory access to the strict minimum, leverage data locality and only require point-to-point communication (Podladtchikov and Podladchikov, 2013).We here follow a similar strategy.
We here introduce two metrics, the effective memory throughput T eff (early formulations of effective memory throughput analysis are found, e.g., in Omlin et al. (2015a), Omlin et al. (2015b), and Omlin ( 2017)) and the parallel efficiency E (Kumar et al., 1994;Gustafson, 1988;Eager et al., 1989).The effective memory throughput permits to assess the single-processor (GPU or CPU) performance and lets deduce potential room for improvement.The parallel efficiency permits to assess distributed memory scalability which may be hindered by inter-process communication, congestion of shared filesystems and other practical considerations from scaling on large supercomputers.We perform single-GPU problem size scaling benchmarks to assess the optimal local problem size based on the T eff metric.We further use the optimal local problem size in weak scaling benchmarks to assess the parallel efficiency E(N, P ).

The effective memory throughput
Many-core processors such as GPUs are throughput-oriented systems that use their massive parallelism to hide latency.On the scientific application side, most algorithms require fewer floating point operations per second, FLOPS, compared to the amount of numbers or bytes accessed from main memory, and thus are significantly memory bound.The FLOPS metric being no longer the most adequate for reporting the application performance (e.g.Fuhrer et al., 2018) in a majority of cases motivated us to develop a memory throughput-based performance evaluation metric, T eff , to evaluate the performance of iterative stencil-based PDE solvers.
The effective memory access, A eff [GB], is the the sum of twice the memory footprint of the unknown fields, D u , (fields that depend on their own history and that need to be read from and written to every iteration) and the known fields, D k , that do not change every iteration.The effective memory access divided by the execution time per iteration, t it [sec], defines the effective memory throughput, T eff [GB/s]: The upper bound of T eff is T peak as measured e.g. by McCalpin et al. (1995) for CPUs or a GPU analogue.Defining the T eff metric, we assume that i) we evaluate an iterative stencil-based solver, ii) the problem size is much larger than the cache sizes and iii) the usage of time blocking is not feasible or advantageous (which is a reasonable assumption for real-world applications).An important concept is not to include fields within the effective memory access that do not depend on their own history (e.g.fluxes); such fields can be re-computed on the fly or stored on-chip.Defining a theoretical upper bound for T eff that is closer to the real upper bound is work in progress (Omlin et al., 2021b).

The parallel efficiency
We employ the parallel efficiency metric to asses the scalability of the iterative solvers when targeting distributed memory configurations, such as multi-GPU settings.In a weak scaling configuration, i.e.where the global problem size and computing resources increase proportionally, the parallel efficiency E(N, P ) defines the ratio between the execution time of a single process, T (N, 1), and the execution time of P processes performing the same number of iterations on an P -fold larger problem, T (N • P, P ), where N is the local problem size and P is the number of parallel processes: Distributed parallelisation permits to overcome limitations imposed by the available main memory of a GPU or CPU.It is particularly relevant for GPUs, which have significantly less main memory available than CPUs.Distributing work amongst multiple GPUs, using e.g. the message passing interface (MPI), permits to overcome these limitations and requires parallel computing and supercomputing techniques.Parallel efficiency is a key metric in light of assessing the overall application performance as it ultimately ensures scalability of the pseudo-transient method.

The numerical experiments
We design a suite of numerical experiments to verify the scalability of the accelerated pseudo-transient method, targeting diffusive processes and mechanics.We consider three distinct diffusion problems in one, two and three dimensions, that exhibit a diffusion coefficient being i) linear, ii) a step function with four orders of magnitude contrast and iii) a cubic power-law relation.We then consider mechanical processes using a velocity-pressure formulation to explore various limits including variable-viscosity incompressible viscous flow limit, accounting for a Maxwell visco-elastic shear rheology.To demonstrate the versatility of the approach, we tackle the nonlinear mechanical problem of strain localisation in two and three dimensions considering an elasto-viscoplastic rheology (Sect.8).Finally, we verify the robustness of the accelerated PT method by considering two parametric studies featuring different viscous Stokes flow patterns, and demonstrate the convergence of the method for viscosity contrasts up to 9 orders of magnitude.

Diffusive processes
We first consider time-dependent (transient) diffusion processes defined by Eq. ( 1) and ( 2), with the proportionality coefficient ρ = 1.Practical applications often exhibit at least one diffusive component which can be either linear or nonlinear.Here, we consider linear and nonlinear cases representative of challenging configurations common to a broad variety of forward numerical diffusion-type of models: 1.The first case exhibits a linear constant (scalar) diffusion coefficient: 2. The second case exhibits a spatially variable diffusion coefficient with a contrast of 4 orders of magnitude: where L is the domain extend in a specific dimension and L D the coordinate at which the transition occurs.Large contrasts in material parameters (e.g.permeability or heat conductivity) are common challenges solvers needs to handle when targeting real-world applications.
3. The third case exhibits a nonlinear power-law diffusion coefficient: where n = 3, a characteristic value in, e.g., soil and poro-mechanical applications to account for the porosity-permeability Carman-Kozeny (Costa, 2006) relation leading to the formation of solitary waves of porosity.Shallow ice approximation or nonlinear viscosity in power-law creep Stokes flow are other applications that exhibit effective diffusion coefficients to be defined as power-law relations.

Mechanics
We secondly consider steady-state mechanical problems, defined by Eq. ( 22) and ( 23).In practice, we employ a velocitypressure formulation, which allows to also handle the incompressible flow limit.The rheological model builds upon an additive decomposition of the deviatoric strain rate tensor (Maxwell's model), given by Eq. ( 33).In the following Sect.5.2, the mechanical problem is solved in the incompressible limit and assuming a linear visco-elastic deviatoric rheology.
In the subsequent application (Sect.8), the mechanical problem is solved in the compressible elasto-viscoplastic limit.
Hence, the deviatoric rheological model neglects viscous flow and includes viscoplastic flow where λ and Q stand for the rate of the plastic multiplier and the plastic flow potential, respectively.A similar decomposition is assumed for the divergence of velocity in Eq. ( 23), which is no longer equal to zero in order to account for elastic and plastic bulk deformation: where K stands for the physical bulk modulus.
In the inclusion parametric study described in Sect.5.2, we consider the incompressible viscous Stokes flow limit, i.e.
5 The model configurations

The diffusion model
We perform the three different diffusion experiments (see Sect. 4.1) on 1D, 2D and 3D computational domains (Fig. 2a, b and c, respectively).The only difference between the numerical experiments relies in the definition of the diffusion coefficient D.
The non-dimensional computational domains are for 1D, 2D and 3D domains, respectively.The domain extend is L x = L y = L z = 10.The initial condition H 0 consists of a Gaussian distribution of amplitude and standard deviation equal to one located in the domain's centre; in the 1D case: where x c is the vector containing the discrete 1D coordinates of the cell centres.The 2D and 3D cases are done by analogy and contain the respective terms for the y and z directions.We impose Dirichlet boundary conditions such that H = 0 on all boundaries.We simulate a total non-dimensional physical time of 1 performing 5 implicit time steps of ∆t = 0.2.

The Stokes flow model
We perform the visco-elastic Stokes flow experiments (see Sect. 4.2) on 2D and 3D computational domains (Fig. 3a and b, respectively).The non-dimensional computational domains are for 2D and 3D domains, respectively.The domain extend is L x = L y = L z = 10.We define as initial condition a circular (2D) or spherical (3D) inclusion of radius r = 1 centred at L x /2, L y /2 (2D) and L x /2, L y /2, L z /2 (3D) featuring 3 order of  magnitude lower shear viscosity µ inc s = 10 −3 compared to the background value µ 0 s = 1 (Fig. 3).We then perform 10 explicit diffusion steps of the viscosity field µ s to account for smoothing introduced by commonly employed advection schemes (e.g.markers-in-cells, semi-Lagrangian or weighted ENO).We define a uniform and constant elastic shear modulus G = 1 and chose the physical time step ∆t = µ 0 s /G/ξ to satisfy a visco-elastic Maxwell relaxation time of ξ = 1.We impose pure shear boundary conditions; we apply compression in the x-direction and extension in the vertical (y in 2D, z in 3D) direction with a background strain rate ε BG = 1.For the 3D case, we apply no inflow/outflowin the y-direction (Fig. 3b).All models boundaries are free to slip.We perform a total of 5 implicit time steps to resolve visco-elastic stress build-up.
We further perform a series of viscous Stokes numerical experiments in 2D (see Sect. 4.2) to analyse the dependence of the optimal iteration parameters on the material viscosity contrast and the volume fraction of the material with lower viscosity.
The non-dimensional computational domain is We define as initial condition a number n inc of circular inclusions semi-uniformly distributed in the domain.The viscosity in the inclusions is µ inc s and the background viscosity is In the parametric study, we vary the number of inclusions n inc , the inclusion viscosity µ inc s , and the iteration parameter Re.We consider a uniform 3D grid of parameter values, numerically calculating for each combination of these the steady-state distribution of stresses and velocities.
We consider two different problem setups that correspond to important edge cases.The first setup addresses the shear-driven flow where the strain rates are assumed to be applied externally via boundary conditions.This benchmark might serve as a basis for the effective media properties calculation.The second setup addresses the gravity-driven flow with buoyant inclusions.This benchmark is relevant for geophysical applications, e.g.modelling magmatic diapirism or melt segregation, where the volumetric effect of melting leads to the development of either the Raileigh-Taylor instability or compaction instability, respectively.In the first setup, we specify pure-shear boundary conditions similarly to the singular inclusion case described in Sect.5.2.
The body forces f i are set to zero in Eq. ( 22).
In the second setup, we specify the free-slip boundary conditions, which corresponds to setting the background strain rate ε BG to 0. We model buoyancy using the Boussinesq approximation: the density differences are accounted for only in the body forces.We set f x = 0, f y = −ρg.We set ρg 0 = 1, ρg inc = 0.5.

Discretisation
We discretise the systems of partial differential equations (Sect.4) using the finite-difference method on a regular Cartesian staggered grid.For the diffusion process, the quantity being diffused and the fluxes are located at cell centre and at cell interfaces, respectively.For the Stokes flow, pressure, normal stresses and material properties (e.g.viscosity) are located at cell centres while velocities are located at cell interfaces.Shear stress components are located at cell vertices.The staggering relies on second-order conservative finite-differences (Patankar, 1980;Virieux, 1986;McKee et al., 2008) ensuring also the Stokes flow to be inherently devoid of oscillatory pressure modes (Shin and Strikwerda, 1997).
The diffusion process and the visco-elastic Stokes flow include physical time evolution.We implement a backward Euler time integration within the pseudo-transient solving procedure (see Sect. 2) and do not assess higher order schemes as such considerations go beyond the scope of this study.
We converge in all simulations the scaled and normalised L2-norm of the residuals, ||R|| L2 / √ n R , where n R stands for the number of entries of R, for each physical time step to a nonlinear absolute tolerance of tol nl = 10 −8 within the iterative pseudo-transient procedure (absolute and relative tolerances being comparable given the non-dimensional form of the example we here consider).
The 46-lines code fragment (Fig. 4) informs about the concise implementation of the accelerated PT algorithm, here for the 1D nonlinear power-law diffusion case (D = H 3 ).Besides the initialisation part (lines 3-22), the core of the algorithm is contained in no more than 20 lines (lines 23-43).The algorithm is implemented as two nested (pseudo-)time loops, referred to as "dual-time"; the outer loop advancing in physical time, the inner loop converging the implicit solution in pseudo-time.The nonlinear term, here the diffusion coefficient D, is explicitly evaluated within the single-loop iterative procedure, removing the need of performing nonlinear iterations on top of a linear solve (e.g.Brandt, 1977;Trottenberg et al., 2001;Hackbusch, 1985).This single loop local linearisation shows, in practice, lower iteration count when compared against global linearisation (nested loops).Note that a relaxation of nonlinearities can be implemented in straight-forward fashion in case the nonlinear term hinders convergence (see implementation details in , e.g., Räss et al. (2020Räss et al. ( , 2019a)); Duretz et al. (2019b)).The iteration parameters are evaluated locally which ensures scalability of the approach and removes the need of performing global reductions, costly in parallel implementation.Note that the numerical iteration parameters ρ and θ r , arising from the finite-difference discretisation of pseudo-time derivatives in Eq. ( 16) and ( 6),  where k is the current pseudo-time iteration index, always occur in combination with ∆τ .Since we are not interested in the evolution of pseudo-time or the particular values of iteration parameters, it is possible to combine them in the implementation.
We therefore introduce the variables dτ _ ρ = ∆τ /ρ and θr_ dτ = θr/∆τ .Using the new variables helps to avoid specifying the value of ∆τ which could otherwise be specified arbitrarily.The 2-lines of physics, namely the PT updates, are here evaluated in an explicit fashion.Alternatively, one could solve for qHx and H assuming their values in the residual -the terms contained in the right-most parenthesis-to be new instead of current resulting in an implicit update.Advantages rely in enhanced stability (CFL on line 10 could be set to 1) and removes the need of defining a small number (ε in the iteration parameters definition) to prevent division by 0. The implicit approach is implemented as alternative in the full code available online in the

The Julia multi-XPU implementation
We use the Julia language (Bezanson et al., 2017) to implement the suite of numerical experiments.Julia's high-level abstractions, multiple dispatch and meta-programming capabilities make it amenable to portability between backends (e.g.multi-core CPUs and Nvidia or AMD GPUs).Also, Julia solves the two-language problem making it possible to fuse prototype and production applications into a single one being both high-level and performance oriented -ultimately increasing productivity.
We use the ParallelStencil.jl(Omlin et al., 2021b) and ImplicitGlobalGrid.jl(Omlin et al., 2021a) Julia packages we developed as building blocks to implement the diffusion and Stokes numerical experiments.ParallelStencil.jlpermits to write architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs -here referred to as XPUs.Performance similar to native CUDA C/C++ (Nvidia GPUs) or HIP (AMD GPUs) can be achieved.ParallelStencil.jlseamlessly composes with ImplicitGlobalGrid.jl, which allows for distributed parallelisation of stencil-based XPU applications on a regular staggered grid.In addition, ParallelStencil.jlenables hiding communication behind computation, where the communication package used can, a priori, be any package that lets the user control when communication is triggered.The communication and computation overlap approach splits local domain calculations in two regions, boundary regions and inner region, the latter containing majority of the local domain's grid cells.After successful completion of the boundary region computations, halo update (using e.g.point-to-point MPI) overlaps with inner point computations.Selecting the appropriate width of the boundary region permits to fine-tune optimal hiding of MPI communication (Räss et al., 2019c;Alkhimenkov et al., 2021b).

Results
We here report the performance of the accelerated pseudo-transient Julia implementation of the diffusion and the Stokes flow solvers targeting Nvidia GPUs using ParallelStencil.jl'sCUDA backend.For both physical processes, we analyse the iteration count as function of number of grid cells (i.e. the algorithmic performance), the effective memory throughput T eff [GB/s] (performing a single device (GPU) problem size scaling), and the parallel efficiency E (multi-GPU weak scaling).
We report the algorithmic performance as the iteration count per number of physical time steps normalised by the number of grid cells in the x-direction.We do not normalise by the total number of grid cells in order to report the one-dimensional scaling even for 2D or 3D implementation.We motivate our choice as it permits a more accurate comparison to analytically derived results and leave it to the reader to appreciate the actual quadratic and cubic dependence of the normalised iteration count if using the total number of grid cells in 2D and 3D configurations, respectively.

Solving the diffusion equation
We report a normalised iteration count per total number of physical time steps n t per number of grid cells in the x-direction n x , iter tot /n t /n x , for the 1D, 2D and 3D implementations of the diffusion solver for the linear, step function and nonlinear case (Fig. 5a, b and c, respectively) relating to the spatial distribution of H after 5 implicit time steps (Fig. 6a-c, d-f and g-i, respectively).All three different configurations exhibit a normalised number of iterations per time step per number of grid cells close to 1 for the lowest resolution of n x = 64 grid cells.The normalised iteration count drops with increase in numerical resolution (increase in number of grid cells) suggesting a super-linear scaling.We observe similar behaviour when increasing the number of spatial dimensions while solving the identical problem; 3D calculations are more efficient on a given number of grid cells n x compared to the corresponding 1D or 2D calculations.
Interesting to note that the diffusion solver with nonlinear (power-law) diffusion coefficient reports the lowest normalised iteration count for all three spatial dimension implementations, reaching the lowest number (> 0.4) of normalised iteration count in the 3D configuration (Fig. 5c).The possible explanation of lower iteration counts for the nonlinear problem is that by the nature of the solution, the distribution of the diffused quantity H at t = 1 is much closer to the initial profile than in the linear case.Therefore, at each time step, the values of H are closer to the values of H at the next time step and thus serve as a better initial approximation.Both the diffusion with linear (Fig. 5a) and step function as diffusion coefficient (Fig. 5b) show similar trend in their normalised iteration count with values decreasing while increasing the number of spatial dimensions.(d-f) and lower (g-i) panels refer to the diffusion with linear, linear step and nonlinear (power-law) diffusion coefficient, respectively.For the 1D case, the blue and red lines represent the initial and final distribution, respectively.The colormap (2D and 3D) relates to the y-axis (1D).

Solving visco-elastic Stokes flow
We further report the normalised iteration count per total number of physical time steps n t per number of grid cells in the x-

Performance
We use the effective memory throughput T eff [GB/s] and the parallel efficiency E(N, P ) to asses the implementation performance of the accelerated pseudo-transient solvers, as motivated in Sect.3. We perform the single-GPU problem size scaling and the multi-GPU weak scaling tests on different Nvidia GPU architectures, namely the "data-centre" GPUs, Tesla P100 (Pascal -PCIe), Tesla V100 (Volta -SXM2) and Tesla A100 (Ampere -SXM2).We run the weak scaling multi-GPU benchmarks on the Piz Daint supercomputer, featuring up to 5704 Nvidia Tesla P100 GPUs, at the Swiss National Supercomputing Centre CSCS, on the Volta node on the Octopus supercomputer, featuring 8 Nvidia Tesla V100 GPUs with high-throughput (300 GB/s) SXM2 interconnect, at the Swiss Geocomputing Centre, University of Lausanne, and on the Superzack node, featuring 8 Nvidia Tesla A100 GPUs with high-throughput (300 GB/s) SXM2 interconnect, at the Lab.Hydraulics, Hydrology, Glaciology (VAW), ETH Zurich.
We asses the performance of the 2D and 3D implementation of the nonlinear diffusion solver (power-law diffusion coefficient) and the visco-elastic Stokes flow solver, respectively.We perform single-GPU problem size scaling tests for both the 2D and 3D solvers' implementation, and multi-GPU weak scaling tests for the 3D solvers' implementation only.We report the mean performance out of 5 executions, if applicable.

Single-GPU problem size scaling and effective memory throughput
The 2D and 3D nonlinear diffusion solver single-GPU problem size scaling benchmarks achieve similar effective memory throughput on the targeted GPU architectures relative to their respective peak values T peak ; values of T eff for the 2D implementation being slightly higher than the 3D ones for the Volta and Pascal architectures, but similar for Ampere one.This discrepancy is expected and may be partly explained by an increase in cache-misses when accessing z-direction neighbours 580 which are n x • n y grid cells away in main memory.We achieve in 2D T eff ≈ 920 GB/s, 590 GB/s and 400 GB/s on the Tesla A100, Tesla V100 and the Tesla P100, respectively (Fig. 9a).In 3D, we achieve T eff ≈ 920 GB/s, 520 GB/s and 315 GB/s on the Tesla A100, Tesla V100 and the Tesla P100, respectively (Fig. 9b).
For the analogous visco-elastic Stokes flow single-GPU problem size scaling tests, we report also higher T eff values for the 2D compared to the 3D implementation for all three targeted architectures.We achieve in 2D T eff ≈ 930 GB/s, 500 GB/s and 320 GB/s on the Tesla A100, Tesla V100 and the Tesla P100, respectively (Fig. 10a).In 3D, we achieve T eff ≈ 730 GB/s, 350 GB/s and 230 GB/s on the Tesla A100, Tesla V100 and the Tesla P100, respectively (Fig. 10b).Increased neighbouring access and overall more derivative evaluations may explain the slightly lower effective memory throughput of the visco-elastic Stokes flow solver when compared to the nonlinear diffusion solver.
For reference, the dashed lines (Figs. 9 and 10) represent the peak memory throughput T peak [GB/s] one can achieve on a specific GPU architecture, which is a theoretical upper bound of T eff .

Weak scaling and parallel efficiency
We assess the parallel efficiency of the 3D nonlinear diffusion and visco-elastic Stokes flow solver multi-GPU implementation performing a weak-scaling benchmark.We use (per GPU) local problem size of 512 3 for the nonlinear diffusion, and 383 3 and 511 3 for the visco-elastic Stokes flow on the Pascal and Tesla architectures, respectively.Device RAM limitations prevent to solve a larger local problem in the latter case.The 3D nonlinear diffusion solver achieves a parallel efficiency E of 97% on 8 Tesla A100 and V100 SXM2 and 98% on 2197 Tesla P100 GPUs (Fig. 11a).The visco-elastic Stokes flow solver achieves a parallel efficiency E of 99% on 8 Tesla A100 SXM2, and 96% on 8 Tesla V100 SXM2 and on 2197 Tesla P100 GPUs (Fig. 11b), respectively.2197 GPUs represent a 3D Cartesian topology of 13 3 , resulting in global problem sizes of 6632 3 and 4995 3 We emphasise that we follow a strict definition of parallel efficiency, where the runtimes of the multi-XPU implementations are to be compared against the best known single-XPU implementation.As a result, the reported parallel efficiency is also with a single GPU below 100%, correctly showing that the implementation used for distributed parallelisation performs slightly less good then the best known single-GPU implementation.This small performance loss emerges from the computation splitting in boundary and inner regions required by the hide communication feature.Parallel efficiency close to 100% is important to ensure weak scalability of numerical applications when executed on a growing number of distributed memory processes P , the path to leverage current and future supercomputers' exascale capabilities.

Multiple inclusions parametric study
We perform a multiple inclusions benchmark to assess the robustness of the developed accelerated PT method.We vary the viscosity contrast from 1 to 9 orders of magnitude to demonstrate the successful convergence of iterations even for extreme cases that might arise in geophysical applications such as strain localisation.Further, we vary the number of inclusions from 1 to 46 to verify the independence of convergence on the "internal geometry" of the problem.For each combination of viscosity ratio and number of inclusions, we perform a series of simulations varying the iteration parameter Re to assess the influence of the problem configuration on its optimal value and verify if the analytical prediction obtained by the dispersion analysis remains valid over the considered parameter range.For this parametric study, we considered a computational grid consisting of n x × n y = 2048 2 cells.At lower resolutions the convergence deteriorates, resulting in non-converging large viscosity contrasts configurations.The high grid resolution is thus necessary for resolving the small-scale details of the flow.We also adjust the nonlinear tolerance for the iterations to 10 −5 and 10 −3 for momentum and mass balance, respectively, given our interest in relative dependence of iteration counts on iteration parameter Re.
Figure 12 depicts the results for the shear-driven flow case.For a single inclusion (Fig. 12a), the optimal value of iteration parameter Re does not differ significantly from the one reported by Eq. ( 31).Moreover, the theoretical prediction for Re remains valid for all viscosity contrasts considered in the study.
For problem configurations involving 14 and 46 inclusions (Fig. 12b and c, respectively), the minimal number of iterations is achieved for values of Re close to the theoretical prediction only for viscosity contrast of 1 order of magnitude.For larger viscosity contrast, the optimal value of Re appears to be lower then theoretically predicted, and the overall iteration count is significantly higher.These iteration counts reach 40n x at the minimum among all values of Re for a given viscosity ratio, and > 50n x for non-optimal values of Re.
For buoyancy-driven flow (Fig. 13), the convergence of iterations is less sensitive to both the number of inclusions and the viscosity ratio.The observed drift in the optimal value of Re could be partly attributed to the lack of a good preconditioning technique.In this study, we specify the local iteration parameters in each grid cell based on the values of material parameters, which could be regarded as a form of diagonal (Jacobi) preconditioning.This choice is motivated by the parallel scalability requirements of GPU and parallel computing.Even without employing more advanced preconditioners, our method remains stable and successfully converges for viscosity contrasts up to 9 orders of magnitude, though at the cost of increased number  In both shear-driven and gravity-driven problem setups, the convergence is significantly slower than that for the single centred inclusion case.This slowdown could be explained by the complicated internal geometry involving non-symmetrical inclusion placement featuring huge viscosity contrasts which results in a stiff system.

Applications to nonlinear mechanical problems with elasto-viscoplastic rheology
To demonstrate the versatility of the approach, we tackle the nonlinear mechanical problem of strain localisation in two and three dimensions.In the following applications we consider an elasto-viscoplastic (E-VP) rheological model, thus the serial viscous damper is deactivated and the flow is compressible.We assume a small-strain approximation.Hence, the deviatoric strain rate tensor may be decomposed in an additive manner in equation 42.A similar decomposition is assumed for the divergence of velocity in equation 43.The plastic model is based on consistent elasto-viscoplasticity and the yield function is defined as: where τ II is the second stress invariant, φ is the friction angle, η vp is the viscoplastic viscosity and c is the cohesion.At the trial state, F trial is evaluated assuming no plastic deformation ( λ = 0).Cohesion strain softening is applied and the rate of c is expressed as: where h is a hardening/softening modulus.Viscoplastic flow is non-associated and the potential function is expressed as: where ψ is the dilatancy angle.If F ≥ 0, viscoplastic flow takes place and the rate of the plastic multiplier is positive and defined as: The initial model configuration assumes a random initial cohesion field.Pure shear kinematics are imposed at the boundaries of the domain (see Sect. 5.2).The reader is referred to Duretz et al. (2019a) for the complete set of material parameters.We only slightly modify the value of µ vp with respect to Duretz et al. (2019a), such that µ vp = 9 × 10 18 Pa•s in the present study.

Performance benefits for desktop-scale computing
Besides the potential to tackle nonlinear multi-physics problem at supercomputer-scale, another critical aspect relies in the ability to solve smaller-scale nonlinear problems.Here we investigate wall-times for the simulation of the previously-described with the Cholesky factorisation of the symmetrised Jacobian (Räss et al., 2017).In the 2D pseudo-transient solver, written in Julia using the ParallelStencil.jlpackages, the evaluation of nonlinearities is embedded in the pseudo-time integration loop.
The timings reported for the DI and the PT schemes were produced on a 2.9 GHz Intel Core i5 processor and on a single Nvidia Tesla V100 GPU, respectively.Each simulation resolves 100 physical (implicit) time steps.Models were run on resolutions involving 62 2 to 510 2 and up to 1022 2 grid cells for the DI and PT scheme , respectively.As expected for 2D computations, reported wall-times are lower using the DI scheme at very low resolutions.However, it is interesting to observe that the GPUaccelerated PT scheme can deliver comparable wall-times at already relatively low resolutions (n x ≈ 254).The employed CPU and GPU can be considered as common devices on current scientific desktop machine.We can thus conclude that the use of the GPU-accelerated PT schemes is a viable and practical approach to solve nonlinear mechanical problems on a desktop-scale computer.Moreover the proposed PT scheme turns out to be beneficial over common approaches (DI schemes) at relatively low resolutions, already.

High-resolution 3D results
We here present preliminary 3D results of the spontaneous development of visco-plastic shear bands in pure shear deformation from an initial random cohesion field (Fig. 15).These 3D results further demonstrate the versatility of the pseudo-transient approach, enabling the seamless port of the 2D E-VP algorithm to 3D, extending recent work by (Duretz et al., 2019a) to tackle three-dimensional configurations.We generate the 3D initial condition, a Gaussian random field with an exponential co-variance function, following the approach described in (Räss et al., 2019b), available through the ParallelRandomFields.jl Julia package (Räss and Omlin, 2021).We perform 100 implicit loading steps using the accelerated pseudo-transient method and execute the parallel Julia code relying on ParallelStencil.jl and ImplicitGlobalGrid.jl on 8 Tesla V100 on the Volta node.
Both the 2D and 3D elasto-viscoplastic algorithms require only minor modifications of the visco-elastic Stokes solver discussed throughout this manuscript to account for brittle failure, deactivation of the serial viscous damper and viscoplastic regularisation without significantly affecting the convergence rate provided by the second order method.These results support the robustness of the approach as predicting elasto-plastic deformation capturing brittle failure categorises as a rather "stiff" problem challenging the numerical solvers accordingly.

Discussion
The continuous development of many-core devices, GPUs at the forefront, increasingly shapes the current and future computing landscape.The fact that GPUs and latest multi-core CPUs turn classical workstations into personal supercomputers is exciting; tackling previously impossible numerical resolutions or multi-physics solutions becomes feasible as of technical progress.
However, the current chip design challenges legacy serial and non-local or sparse matrix-based algorithms seeking at solutions to partial differential equations.Naturally, solution strategies designed to specifically target efficient large-scale computations on supercomputers perform most efficiently on GPUs and recent multi-core CPUs, as the algorithms used are typically local and minimise memory accesses.Moreover, efficient strategies will not or only modestly rely on global communication and as a result exhibit close to optimal scaling.
We introduced the PT method in light of, mostly, iterative type of methods such as dynamic relaxation and semi-iterative algorithms (see, e.g., Saad, 2020, for additional details).These classical methods, as well as the here presented accelerated PT method, implement "temporal" damping by considering higher-order derivatives with respect to pseudo-time.This contrasts with multi-grid or multi-level methods, building upon a "spatial" strategy based on space discretisation properties to damp the low-frequency error modes.Multi-grid, or multi-level methods are widely used to achieve numerical solutions in analogous settings as here described (Brandt, 1977;Zheng et al., 2013).Also, multi-grid methods may achieve convergence in O(n x ) iterations upon employing optimal relaxation algorithm (Bakhvalov, 1966).
Besides their scalable design, most iterative methods are challenged by configurations including heterogeneities and large contrasts in material parameters, motivated by typical applications to a variety of geodynamics problems (e.g.Baumgardner, 1985;Tackley, 1996;May et al., 2015;Kaus et al., 2016).Well-tuned robust multi-grid solvers may overcome these limitations at the costs of more complex implementations.Our systematic investigation results (Sect.5.2) suggest, however, the PT method to perform promisingly well, and at no specific additional design efforts w.r.t. the basic accelerated PT method implementation.
The ease of implementation lists among the main advantages of the accelerated PT method over other more complex ones such as, e.g., multi-grid.Particularly, all nonlinearities can be relaxed within a unique iteration loop, as reported in the nonlinear diffusion results (Sect.7.1).Often, due to the properties of the problem, the number of iterations does not exceed or is even less than that in the case of constant coefficients.Other methods, in which nonlinear iterations are performed separately from the linear solver, cannot take advantage of the details of the physical process to reduce iteration counts.Also, for significantly nonlinear problems, e.g., associated with plastic strain localisation, thermo-mechanical shear-heating or two-phase fluid focusing, the physical processes occur on multiple spatial scales.Thus, ensuring an accurate description of these multi-scale solutions requires a high-resolution computational grid; it may be challenging for a coarse level multi-gird solve to provide sufficient information in order to accurately resolve small-scale details only resolvable on the fine grid.Note that for analogous reasons, multi-grid methods are often used as a preconditioner for other iterative algorithms rather than the solution method.Besides seeking at optimal convergence of the algorithm, the implementation efficiency also favours the accelerated PT method; the approach is simple but efficient, making it possible to further implement advanced optimisations such as explicit shared memory usage and register handling.The choice of a Cartesian regular grid allow for static and structures memory access patterns, resulting in access optimisation possibilities and balanced inter-process communications.Also, the absence of global reduction in the algorithm avoids severe bottlenecks.Finally, the amount of data transferred in the accelerated PT method is minimal, which allows achieving near-ideal scaling on distributed-memory systems, as reported in Sect.7.3.2.
The main task in the design of PT methods is the optimal iteration parameters' estimation.For that, the spectral radius of the finite difference operator is often approximated based on the Gershgorin circle theorem (Papadrakakis, 1981).In this paper, we propose an alternative derivation of the PT algorithm entirely based on a physical analogy.The analysis of the convergence rate can be carried out within the framework of the spectral analysis of continuous equations, rather than the finite-dimensional space discretisation.The advantage of this approach relies in the availability of a physical interpretation of the iteration parameters, as well as in a clear separation of physics and numerics.For example, we show that for viscoelastic Stokes flow (Sect.2.4), the pseudo-transient iteration parameters depend on the Reynolds number and the bulk-to-shear elastic modulus ratio.The stability of the iterative process is ensured by a choice of pseudo-physical material properties that is consistent with the conditions obtained on the basis of a von Neumann stability analysis.
The determination of the optimal iterative parameters is thereby reduced to the search for the optimal values of the dimensionless physical numbers that describe the properties of the underlying physical process.The addition of new physical processes, such as heat conduction, two-phase flow, chemical reactions, will lead to the natural emergence of new dimensionless parameters.Since many physical processes have a similar or even identical mathematical description, it is expected that the derivation of the accelerated PT method for such processes can be carried out similarly to those already developed.In this paper, such a derivation is provided for several important processes, namely, the linear and nonlinear diffusion, diffusion-reaction, non-stationary diffusion, and the visco-elastic Stokes problem.The efficiency of the accelerated PT method is demonstrated for essentially nonlinear problems, as well as for the problems with large contrasts in the material properties.
Recently, Wang et al. (2021) studied fluid flow in deformable porous media using an analogous numerical integration scheme.
They show that under certain assumptions, the equations governing the dynamics of such two-phase flows reduce to a "double damped wave equation" system which is mathematically equivalent to the Eq. ( 12) and ( 6) describing the diffusion-reaction process (Sect.2.1.3).They also report the optimal parameters obtained by dispersion analysis of these equations.These parameters are formulated by Wang et al. (2021) in terms of dimensional physical parameters.Through appropriate rescaling it is possible to recover the non-dimensional form presented in Sect.2.1.3.We believe that our derivation in terms of nondimensional variables helps to reveal the analogy between seemingly different physical properties and facilitates reusing the derived iteration parameters for various applications.We provide analysis for the variety of physical processes, including incompressible Stokes flow, in a unified manner, filling some of the gaps missing in previous studies.for different GPU architectures further permit to extrapolate these wall-time to, e.g., the Nvidia Ampere or Volta architecture, as time per iteration is directly proportional to the effective memory throughput T eff .There is a 3.0× and 1.5× increase in T eff on the A100 and V100 compared to the P100 architecture, respectively.resulting in wall-time in the order of 2.5 min and 2.4 hrs on a A100 powered system and of 5 min and 4.8 hrs a V100 powered system, for the nonlinear diffusion and the visco-elastic Stokes solver, respectively.
In practical applications, the patterns of the flow may change drastically throughout the simulation owing to the spontaneous flow localisation or evolution of the interface between immiscible phases with significantly different properties.It is a requirement for the numerical method to be robust with respect to such changes.The iterative algorithm is expected to converge even in extreme cases, e.g., in the presence of sharp gradients across material properties, and the iteration parameters should be insensitive to arbitrary changes in the internal geometry.We present a parametric study to assess the robustness of the accelerated PT method for typical building blocks for geophysical applications.We considered shear-and buoyancy-driven flows with multiple randomly distributed inclusions in a viscous matrix as proxies for more realistic problem formulations.
We show that our method is capable of modelling flows with viscosity contrasts up to 9 orders of magnitude.The values of optimal iteration parameters obtained by the means of systematic simulation runs do not change significantly for a wide range of material properties and internal configurations of the computational domain.We observe the significant slowdown in convergence for viscosity contrasts larger than 5 orders of magnitude in some of the considered cases.These results are expected given the ill-conditioned problem and motivate development of a scalable preconditioner suitable for massively parallel GPU workloads.The application of a robust preconditioner, with reference to previous discussion, may help to partly alleviate slow convergence.However, for viscosity contrasts of 6 orders of magnitude and above, a significant increase in the number of iterations may be legitimate (May et al., 2015).
The numerical application to resolve shear-banding in elasto-viscoplastic media in 3D supports the versatility and the robustness of the presented approach putting emphasis on successfully handling complex rheology.These examples complement recent studies employing the accelerated pseudo-transient method to resolve spontaneous localisation owing to multi-physics https://doi.org/10.5194/gmd-2021-411Preprint.Discussion started: 13 January 2022 c Author(s) 2022.CC BY 4.0 License.

Conclusions
The current HPC landscape redefines the rules governing applications' performance; the multi-core processors' massive parallelism imposes a memory-bound situation.Our work shows that the most simple dynamic relaxation schemes, implementing parabolic systems, turn out to report extremely low iteration count and to scale super-linearly with resolution increase when elegantly transformed into hyperbolic systems using appropriate and physics-motivated pseudo-transient, or numerical, additions.
Moreover, the conciseness of the accelerated pseudo-transient approach permits the applications to execute at effective memory throughput rate approaching memory copy rates (a theoretical upper bound) of latest GPUs.Further, hiding communication behind computations permits to achieve parallel efficiency over 96% on various distributed memory systems and up to 2197 GPUs.The physics we selected for the numerical experiments represent key building blocks to further tackle various multiphysics coupling, usually the combination of mechanical and diffusive processes.Our systematic results on the multi-inclusion setup with huge viscosity contrasts provides some preliminary results assessing the robustness of the accelerated PT method, which we further employ to resolve shear bad formation in 3D as a result of plastic yielding in elasto-viscoplastic materials.
Our study paves the way for resolving coupled and nonlinear multi-physics applications in natural sciences and engineering on extremely high resolutions on next generation of exascale-capable supercomputers, revamping elegant iterative techniques and implementing them with the portable Julia language.
Introducing the auxiliary parameter Re = Da/Re + Re, it can be seen that the dispersion relation ( A13) is now similar to the one reported in Eq. (A8).The solution for λ k is now given by The resulting optimal value is Re = 2 √ π 2 + Da, and the optimal value for Re is obtained by solving Re(Re) for Re: As expected, in the limit of Da → 0, i.e., when the process becomes diffusion-dominated, the optimal value of the parameter Re, determined by Eq. (A15), coincides with the value given by Eq. (A10) in the case of a purely diffusive process.
The number of iterations required to converge to a tolerance tol is Interestingly, the iteration count estimate given by Eq. (A16) decreases proportionally to √ Da.Therefore, the number of iterations for a diffusion-reaction process will always be lower than for a pure diffusion process if Da > 0.

A4 Incompressible viscous Stokes
The system of equations ( 25)-( 27) can be reduced to a single equation for velocity v x : Following the established procedure, we reformulate Eq. (A17) in terms of deviation from the exact solution, and consider typical terms in Fourier series expansion of : Substituting Eq. (A18) into Eq.(A17), we obtain the following dispersion relation: Depending on the values of the coefficients, the dispersion relation (A19) would have either one real root and two complex conjugate roots or three real roots.
For analysis, it is useful to recast the dispersion relation (A19) in depressed form x 3 + px + q applying a change of variables It is useful to provide an analogy between the presented analysis and some previous studies, namely, pseudo-transient continuation model by (Räss et al., 2018;Räss et al., 2019aRäss et al., , 2020;;Duretz et al., 2019b), and early work by (Frankel, 1950).For this comparison, we consider only stationary diffusion process with D = const.
We reformulate the damped wave equation, Eq. ( 7), as a first-order system introducing R, the pseudo-time derivative of H: In all mentioned studies the numerical discretisation of Eq. ( B1) and (B2) were considered.Let f k be the finite-difference operator approximating the right-hand side of Eq. (B2) at time step k.Using a forward Euler scheme for the time integration, one obtains: Let g k i = θ r R k i /∆τ .Rearranging Eq. ( B3) and (B4) to formulate the update rules for H k i and g k i : and using the definitions of θ r and ∆τ reported by Eq. ( 10) and ( 8): it is evident that, if combined, Eq. ( B5), (B6) and (B7) are equivalent to the formulation of pseudo-transient continuation method presented in Räss et al. (2018).Substituting Eq. (B6) into Eq.(B5) and expressing g k−1 i in terms of H k−1 i and H k−2 i yields which is equivalent to the second-order Richardson method proposed in Frankel (1950).

Figure 1 .
Figure 1.Number of iterations per grid point required for e-fold residual reduction.Panels (a), (b) and (c) correspond to stationary diffusion, stationary reaction-diffusion, and incompressible 2D Stokes problems, respectively.

Figure 4 .
Figure 4. Numerical Julia implementation of the 1D nonlinear diffusion case D = H 3 .Lines marked with # [...] refer to skipped lines.See diff_1D_nonlin_simple.jlscript located in the scripts folder in PseudoTransientDiffusion.jl on GitHub for full code.

Figure 5 .
Figure 5. Iteration count scaled by the number of time steps nt and by the number of grid cells in the x-direction nx as function of nx, comparing 1D, 2D and 3D models for the three different diffusion configurations, (a) linear diffusion, (b) linear step diffusion and (c) nonlinear diffusion, respectively.

Figure 6 .
Figure 6.Model output for the (a, d, g) 1D, (b, e, h) 2D and (c, f, i) 3D time-dependent diffusion of a quantity H.The upper (a-c), centre for the 2D and 3D implementations of the visco-elastic Stokes solver (Fig 7) relating to the spatial 550 distribution of vertical velocity (deviation from background) ∆V vertical , pressure P and shear stress τ shear after 5 implicit time steps (Fig. 8a-b, c-d and e-f, respectively).Both 2D and 3D visco-elastic Stokes flow exhibit a normalised number of iterations per time step per number of grid cells close to 10 for the lowest resolution of n x = 63 grid cells.The normalised iteration count drops with increase in numerical resolution (increase in number of gird cells) suggesting a super-linear scaling.We observe similar behaviour when increasing the number of spatial dimensions from 2D to 3D, while solving the identical problem; 3D 555 calculations are more efficient on a given number of grid cells n x compared to the corresponding 2D calculations, which is in accordance with results for the various diffusion solver configurations.https://doi.org/10.5194/gmd-2021-411Preprint.Discussion started: 13 January 2022 c Author(s) 2022.CC BY 4.0 License.

Figure 7 .
Figure 7. Iteration count scaled by the number of time steps nt and by the number of grid cells in the x-direction nx as function of nx, comparing 2D and 3D visco-elastic Stokes flow containing an inclusion featuring a viscosity contrast (µ0/µi = 10 3 ).

Figure 12 .
Figure 12.Pure shear-driven flow.The left column reports the number of iterations for different values of viscosity ratio µ 0 s /µ inc s and ratio of numerical Reynolds number Re to the theoretically predicted value Reopt; connected white dots indicate the value of Re at which the minimal number iterations is achieved.The right column depicts the distribution of pressure and velocity streamlines.Panels (a), (b), and (c) correspond to problem configurations involving 1, 14, and 46 inclusions, respectively.

Figure 13 .
Figure 13.Buoyancy-driven flow.The left column report the number of iterations for different values of viscosity ratio µ 0 s /µ inc s and ratio of numerical Reynolds number Re to the theoretically predicted value Reopt; connected white dots indicate the value of Re at which the minimal number iterations is achieved.The right column depicts the deviation of pressure from hydrostatic distribution and velocity streamlines.Panels (a), (b), and (c) correspond to problem configurations involving 1, 14, and 46 inclusions, respectively.

Figure 14 .
Figure 14.Performance comparison between the pseudo-transient and direct-iterative method resolving 2D shear band formation out of a random noise cohesion field.(a) Wall-time in seconds as function of numerical grid resolution in x-direction ( √ n dof ).Panels (b) and (c) depict total accumulated plastic strain ε pl II and pressure [MPa], respectively, from the 1022 × 1022 resolution 2D PT GPU simulation.elasto-viscoplastic shear band formation in 2D.We compare a Matlab-based direct-iterative and a Julia GPU-based PT solver, respectively.The M2Di Matlab solver routines (Räss et al., 2017) rely on a Newton linearisation of the nonlinear mechanical problem (Duretz et al., 2019a) and use a combined direct-iterative (DI) solver to compute Newton steps.The solver is based on the combination of outer Powell-Hestenes and inner Krylov iterations (Global Conjugate Residual) that are preconditioned

Figure 15 .
Figure 15.Total accumulated plastic strain ε pl II distribution for a multi-GPU replica of the 2D calculation (Fig. 14b) resolving 3D shear band formation out of a random noise cohesion field.The numerical resolution includes 378 × 378 × 192 grid cells.Panels depict elasto-plastic shear band formation after (a) 60 steps (corresponding 19.1 kyr and a strain of 0.03), (b) 80 steps (corresponding 25.4 kyr and a strain of 0.04), and (c) 100 steps (corresponding 31.6 kyr and a strain of 0.05).

Table 1 .
Wall-time prognostic for resolving the nonlinear diffusion and the visco-elastic Stokes 3D Julia multi-GPU applications on 2197 (13 3 ) Nvidia Tesla P100 GPUs on the Piz Daint supercomputer at CSCS.The scalability of the accelerated PT method as function of numerical resolution permits to predict the total iteration count, here for the nonlinear diffusion and the visco-elastic Stokes in 3D.The weak scaling benchmark results provide the time per iteration as function of the numerical resolution.Combining this information, it is possible to predict the time-to-solution or wall-time (Table1) needed to resolve nonlinear diffusion and visco-elastic Stokes flow on 6632 3 and 4955 3 grid cells, respectively, on 2197 Nvidia Tesla P100 GPUs on the Piz Daint supercomputer at CSCS.Single-GPU problem size scaling results