These authors contributed equally to this work.

The development of highly efficient, robust and scalable numerical algorithms lags behind the rapid increase in massive parallelism of modern hardware.
We address this challenge with the accelerated pseudo-transient (PT) iterative method and present a physically motivated derivation.
We analytically determine optimal iteration parameters for a variety of basic physical processes and confirm the validity of theoretical predictions with numerical experiments.
We provide an efficient numerical implementation of PT solvers on graphical processing units (GPUs) using the Julia language. We achieve a parallel efficiency of more than 96 % on 2197 GPUs in distributed-memory parallelisation weak-scaling benchmarks.
The 2197 GPUs allow for unprecedented tera-scale solutions of 3D variable viscosity Stokes flow on

The recent development of multi-core devices has lead to the democratisation of parallel computing. Since the “memory wall” in the early 2000s

The current computing landscape challenges scientific computing applications looking at solutions to partial differential equations (PDEs) and their legacy implementations that rely on non-local methods, one example being matrix factorisation-based solvers. The main reasons for these applications not performing optimally on modern hardware are that their prohibitive and nonlinear memory utilisation increase as a function of numbers of degrees of freedom (DoFs), proportional to the global problem size or the spatial numerical resolution. As a result, the usage of sparse direct solvers in high-performance computing (HPC) is only possible for relatively small-scale problems due to excessive memory and computational resource requirements, inherently responsible for limitations in parallel scalability. Even storing the sparse matrix structure and nonzero elements in a compressed form is often not possible due to the limited amount of available memory. This situation naturally increases the attraction of iterative matrix-free algorithms for solving large-scale problems.

Pseudo-transient (PT) or dynamic relaxation (DR) methods have seen a regain in development over the last decades. The PT methods are matrix-free and build on a transient physics analogy to establish a stationary solution. Unlike Krylov-type methods such as the conjugate gradient or gradient or generalised minimal residual (GMRES) methods, PT methods build on a fixed-point iteration, in which the update of each grid point is entirely local and does not require global reductions (and thus global communication) at each step of the algorithm. Given the locality of the algorithm, software implementations can achieve very high per-node performance and near-ideal scaling on distributed-memory systems with accelerators such as GPUs. For Krylov-type methods, some work has been done to limit global communication

The PT methods build on a physical description of a process. It therefore becomes possible to model strongly nonlinear processes and achieve convergence starting from nearly arbitrary initial conditions. Conventional linearisation methods such as the Newton–Raphson method may fail to converge if the initial approximation is not close enough to the solution. Examples include problems of resolving strain localisation owing to plastic yielding

The implementation conciseness constitutes another advantage of PT methods compared to matrix-based solvers. PT algorithms are concise and short as the explicit pseudo-time integration preserves similarity to the mathematical description of the system of PDEs. Conciseness supports efficient and thus faster development and significantly simplifies the addition of new physics, a crucial step when investigating multi-physics couplings. Also, the similarity between mathematical and discretised code notation makes PT methods an attractive tool for research and education.

The PT method originated as a dynamic-relaxation method in the 1960s, i.e. when it was applied for calculating the stresses and displacements in concrete pressure vessels

Interestingly, Richardson developed his iterative approach without being aware of the work by Gauss and Seidel, their method being named the Liebmann method when applied to solving PDEs. Early development of iterative algorithms such as 1D projection methods and Richardson iterations depend on the current iterate only. They were well-suited for early low-memory computers, however lacking in efficient convergence rates. The situation changed in 1950, when Frankel introduced second-order iterations as an extension of the Richardson and Liebmann methods, adding dependency on the previous iterate

The accelerated PT method for elliptic equations is mathematically equivalent to the second-order Richardson rule

We identify three important challenges for iterative methods among current ones, namely (1) ensure the iteration count to scale linearly with numerical resolution increase, possibly independent of material parameters' contrasts and nonlinearities, (2) achieve minimal per-device main memory access redundancy at maximal access speed, and (3) achieve a parallel efficiency close to 100 % on multi-device – distributed-memory – systems. In this study, we address (1) by presenting the accelerated PT method and resolving several types of basic physical processes. We consider (2) and (3) as challenges partly related to scientific software design and engineering; we address them using the emerging Julia language

In this work, we present the results of analytical analysis of the PT equations for (non-)linear diffusion and incompressible visco-elastic Stokes flow problems.
We motivate our selection of particular physical processes as a broad range of natural processes categorise mathematically either as diffusive, wave-like or mechanical processes, and thus constitute the main building blocks of multi-physics applications.
We derive iteration parameters' approximations from continuous, non-discretised formulations with emphasis on an analogy between these parameters and non-dimensional numbers arising from mathematical modelling of physical processes. Such a physics-inspired numerical optimisation approach has the advantage of providing a framework building on solid classical knowledge and for which various analytical approaches exist to derive or optimise parameters of interest.
We assess the algorithmic and implementation performance and scalability of the 2D and 3D numerical Julia (multi-)GPU (non-)linear diffusion and visco-elastic Stokes flow implementations. We report scalability beyond tera-scale number of DoFs on up to 2197 Nvidia Tesla P100 GPUs on the

The latest versions of the open-source Julia codes used in this study are available from GitHub within the PTsolvers organisation at

At the core of the PT method lies the idea of considering stationary processes, often described by elliptic PDEs, as the limit of some transient processes described by parabolic or hyperbolic PDEs.

The PT methods were present in literature since the 1950s

In the following, we make two assumptions:

The computational domain is a cube

This domain is discretised with a uniform grid of cells. The number of grid cells is the same in each spatial dimension and is equal to

Let us first consider the diffusion process:

By substituting Eq. (

Firstly, we consider a stationary diffusion process, which is described by Eq. (

The solution to Eq. (

The simplest PT method is to replace physical time

The number of iterations, i.e. the number of steps in pseudo-time required to reach convergence of the simplest method described by Eq. (

One possible solution to circumvent the poor scaling properties of this first-order method would be to employ an unconditionally stable pseudo-time integration scheme. However, that would require solving systems of linear equations, making the solution cost of one iteration equal to the cost of solving the original steady-state problem. We are thus interested in a method that is not significantly more computationally expensive than the first-order scheme, but that offers an improved scalability.

One of the known extensions to the classical model of diffusion incorporates inertial terms in the flux definition

A notable difference between the flux definition from Eqs. (

Equation (

Another restriction on the values of iteration parameters arises from the conditions for the numerical stability of the explicit time integration. The numerical pseudo-time step

We choose parameters

Depending on the value of the parameter

To estimate the optimal value of

Number of iterations per grid point required for e-fold residual reduction. Panels

We present detailed explanations and derivations of the dispersion analysis of different problems in the Appendix

We quantify the convergence rate by the number of iterations

For simplicity we only consider the case

For nonlinear and/or complex flow problems, the corresponding optimal values of iteration parameters such as numerical Reynolds number

The next example addresses stationary diffusion processes coupled with reaction. Here, we assume that reaction is described by the first-order kinetics law:

Equations (

The Eq. (

We present the detailed dispersion analysis for the stationary diffusion–reaction problem in Sect.

It is possible to apply the PT method not only to the solution of stationary problems, but also to problems including physical transient terms. This method is known in the literature as the “dual-time”, or “dual time stepping” method

According to the dual-time method, both physical and pseudo-time derivatives are present in the equation:

The discretisation of the physical time derivative

Frequently, modelling of certain processes requires relatively small time steps in order to capture important physical features, e.g. shear-heating induced strain localisation

Before considering incompressible Stokes equations, we present an illustrative example of shear-driven flow to demonstrate a similarity between already discussed cases addressing generalised diffusion and viscous fluid flow.

Here we consider stationary fluid flow between two parallel plates separated by a distance

The steady-state process described by Eqs. (

The next example addresses the incompressible creeping flow of a viscous fluid, described by Stokes equations:

Similar to the shear-driven flow described in Sect.

We use the primary, or P-wave velocity, as a characteristic velocity scale for the Stokes problem:

In addition to the non-dimensional numerical Reynolds number, here defined as

By analogy to previous cases, substituting

Similar to the diffusion–reaction problem studied in Sect.

The dispersion analysis for 1D linear Stokes equations is detailed in Sect.

In the numerical experiments, we consistently observe faster convergence with slightly higher values of

The last example addresses the incompressible Stokes equations accounting for a physical visco-elastic Maxwell rheology:

As in the transient diffusion case presented in Sect.

Assessing the performance of iterative stencil-based applications is 2-fold and reported here in terms of algorithmic and implementation efficiency.

The accelerated PT method provides an iterative approach that ensures linear scaling of the iteration count with an increase in numerical grid resolution

We introduce two metrics: the effective memory throughput (

Many-core processors such as GPUs are throughput-oriented systems that use their massive parallelism to hide latency. On the scientific application side, most algorithms require fewer floating-point operations per second (FLOPS), compared to the amount of numbers or bytes accessed from main memory, and thus are significantly memory bound. The FLOPS metric, no longer being the most adequate for reporting the application performance

The effective memory access,

We employ the parallel efficiency metric to assess the scalability of the iterative solvers when targeting distributed-memory configurations, such as multi-GPU settings. In a weak-scaling configuration, i.e. where the global problem size and computing resources increase proportionally, the parallel efficiency

Distributed parallelisation permits overcoming limitations imposed by the available main memory of a GPU or CPU. It is particularly relevant for GPUs, which have significantly less main memory available than CPUs. Distributing work amongst multiple GPUs, using e.g. the message passing interface (MPI), permits overcoming these limitations and requires parallel computing and supercomputing techniques. Parallel efficiency is a key metric in light of assessing the overall application performance as it ultimately ensures scalability of the PT method.

We design a suite of numerical experiments to verify the scalability of the accelerated PT method, targeting diffusive processes and mechanics. We consider three distinct diffusion problems in one, two and three dimensions, that exhibit a diffusion coefficient being (i) linear, (ii) a step function with 4 orders of magnitude contrasts and (iii) a cubic power-law relation. We then consider mechanical processes using a velocity–pressure formulation to explore various limits, including variable-viscosity incompressible viscous flow limit, accounting for a Maxwell visco-elastic shear rheology. To demonstrate the versatility of the approach, we tackle the nonlinear mechanical problem of strain localisation in two and three dimensions considering an E-VP rheology (Sect.

We first consider time-dependent (transient) diffusion processes defined by Eqs. (

The first case exhibits a linear constant (scalar) diffusion coefficient:

The second case exhibits a spatially variable diffusion coefficient with a contrast of 4 orders of magnitude:

The third case exhibits a nonlinear power-law diffusion coefficient:

Practically, we implement the transient diffusion using the accelerated PT method, solving Eqs. (

Initial distribution of

We secondly consider steady-state mechanical problems, defined by Eqs. (

In Sect.

In the subsequent application (Sect.

In the inclusion parametric study described in Sect.

We perform the three different diffusion experiments (see Sect.

Initial shear viscosity (

We perform the visco-elastic Stokes flow experiments (see Sect.

We further perform a series of viscous Stokes numerical experiments in 2D (see Sect.

In the parametric study, we vary the number of inclusions

We consider two different problem setups that correspond to important edge cases. The first setup addresses the shear-driven flow where the strain rates are assumed to be applied externally via boundary conditions. This benchmark might serve as a basis for the calculation of effective-media properties. The second setup addresses the gravity-driven flow with buoyant inclusions. This benchmark is relevant for geophysical applications, e.g. modelling magmatic diapirism or melt segregation, where the volumetric effect of melting leads to the development of either the Rayleigh–Taylor instability or compaction instability, respectively.

In the first setup, we specify pure-shear boundary conditions similar to the singular inclusion case described in Sect.

In the second setup, we specify the free-slip boundary conditions, which correspond to setting the background strain rate

We discretise the systems of partial differential equations (Sect.

The diffusion process and the visco-elastic Stokes flow include physical time evolution. We implement a backward Euler time integration within the PT solving procedure (see Sect.

In all simulations we converge the scaled and normalised L2-norm of the residuals,

Numerical Julia implementation of the 1D nonlinear diffusion case

The 46-line code fragment (Fig.

We use the Julia language

Iteration count scaled by the number of time steps (

We use the ParallelStencil.jl

In the present study, we focus on using ParallelStencil.jl with the CUDA.jl back-end to target Nvidia GPUs

We here report the performance of the accelerated PT Julia implementation of the diffusion and the Stokes flow solvers targeting Nvidia GPUs using ParallelStencil.jl's CUDA back-end. For both physical processes, we analyse the iteration count as a function of the number of grid cells (i.e. the algorithmic performance), the effective memory throughput

We report the algorithmic performance as the iteration count per number of physical time steps normalised by the number of grid cells in the

We report a normalised iteration count per total number of physical time steps

We observe similar behaviour when increasing the number of spatial dimensions while solving the identical problem. For example, in the 3D calculations we actually resolve

It is interesting to note that the diffusion solver with nonlinear (power-law) diffusion coefficient reports the lowest normalised iteration count for all three spatial dimension implementations, reaching the lowest number (

Model output for the 1D

We further report the normalised iteration count per total number of physical time steps

The visco-elastic Stokes flow scaling results confirm the trend reporting a decrease of the normalised iteration count with an increase in the numerical resolution (number of grid cells). It is interesting to note that the accelerated PT implementation of the 3D visco-elastic Stokes flow featuring 3 orders of magnitude viscosity contrast (

Iteration count scaled by the number of time steps (

Model output for the 2D

We use the effective memory throughput

We assess the performance of the 2D and 3D implementations of the nonlinear diffusion solver (power-law diffusion coefficient) and the visco-elastic Stokes flow solver, respectively. We perform single-GPU scaling tests for both the 2D and 3D solvers' implementation, and multi-GPU weak-scaling tests for the 3D solvers' implementation only. We report the mean performance out of 5 executions, if applicable.

In order to compute the effective memory throughput

Number of unknown (

The 2D and 3D nonlinear diffusion solver single-GPU scaling benchmarks achieve similar effective memory throughput on the targeted GPU architectures relative to their respective peak values

For the analogous visco-elastic Stokes flow single-GPU scaling tests, we also report higher

Effective memory throughput

Effective memory throughput

We assess the parallel efficiency of the 3D nonlinear diffusion and visco-elastic Stokes flow solver multi-GPU implementation performing a weak-scaling benchmark. We use (per GPU) a local problem size of

We emphasise that we follow a

Parallel efficiency

We perform a multiple-inclusions benchmark to assess the robustness of the developed accelerated PT method. We vary the viscosity contrast from 1 to 9 orders of magnitude to demonstrate the successful convergence of iterations, even for extreme cases, that might arise in geophysical applications such as strain localisation. Further, we vary the number of inclusions from 1 to 46 to verify the independence of convergence on the “internal geometry” of the problem. For each combination of viscosity ratio and number of inclusions, we perform a series of simulations varying the iteration parameter

For this parametric study, we considered a computational grid consisting of

Figure

Pure shear-driven flow. The left column reports the number of iterations for different values of viscosity ratio

Buoyancy-driven flow. The left column reports the number of iterations for different values of viscosity ratio

For buoyancy-driven flow (Fig.

In both shear-driven and gravity-driven problem setups, the convergence is significantly slower than that of the single-centred inclusion case. This slowdown could be explained by the complicated internal geometry involving non-symmetrical inclusion placement featuring huge viscosity contrasts which results in a stiff system.

In geoscientific models that resolve e.g. flow fields for ice sheets and glaciers evolution, lithospheric deformation, or atmospheric and oceanic circulation, there are usually orders of magnitude differences between horizontal and vertical scales. Such domain configurations feature a large aspect ratio that may challenge the solvers because of the presence of more orders-of-magnitude grid cells in the horizontal than the vertical dimensions. Here we systematically investigate the convergence of the 2D visco-elastic Stokes flow while varying the aspect ratio defined as

Iteration count scaled by the number of time steps (

To demonstrate the versatility of the approach, we tackle the nonlinear mechanical problem of strain localisation in 2D and 3D. In the following applications we consider an E-VP rheological model, thus the serial viscous damper is deactivated and the flow includes effects of compressibility and plastic dilatancy. We assume a small-strain approximation. Hence, the deviatoric strain rate tensor may be decomposed in an additive manner in Eq. (

The initial model configuration assumes a random initial cohesion field. Pure shear kinematics are imposed at the boundaries of the domain (see Sect.

Besides the potential to tackle nonlinear multi-physics problems at supercomputer-scale, the ability to solve smaller-scale nonlinear problems remains an important aspect. Here we investigate wall times for the simulation of the previously-described E-VP shear-band formation in 2D (Fig.

Performance comparison between the pseudo-transient (PT) and direct-iterative (DI) method resolving 2D shear-band formation out of a random noise cohesion field.

We present preliminary 3D results of the spontaneous development of visco-plastic shear bands in pure shear deformation from an initial random cohesion field (Fig.

Both the 2D and 3D E-VP algorithms require only minor modifications of the visco-elastic Stokes solver discussed throughout this paper to account for brittle failure, deactivation of the serial viscous damper and viscoplastic regularisation without significantly affecting the convergence rate provided by the second-order method. These results support the robustness of the approach, predicting elasto-plastic deformation and capturing brittle failure categorised as a rather “stiff” problem which challenges the numerical solvers accordingly.

The continuous development of many-core devices, with GPUs at the forefront, increasingly shapes the current and future computing landscape. The fact that GPUs and the latest multi-core CPUs turn classical workstations into personal supercomputers is exciting. Tackling previously impossible numerical resolutions or multi-physics solutions becomes feasible as a result of technical progress. However, the current chip design challenges legacy serial and non-local or sparse matrix-based algorithms, seeking solutions to partial differential equations. Naturally, solution strategies designed to specifically target efficient large-scale computations on supercomputers perform most efficiently on GPUs and recent multi-core CPUs, as the algorithms used are typically local and minimise memory accesses. Moreover, efficient strategies will not or only modestly rely on global communication and as a result, exhibit close to optimal scaling.

We introduced the PT method in light of, mostly, iterative type of methods such as dynamic relaxation and semi-iterative algorithms

Total accumulated plastic strain

Besides their scalable design, most iterative methods are challenged by configurations including heterogeneities and large contrasts in material parameters, motivated by typical applications to a variety of geodynamics problems

The ease of implementation is listed among the main advantages of the accelerated PT method over other more complex ones, e.g. multi-grid.
Particularly, all nonlinearities can be relaxed within a unique iteration loop, as reported in the nonlinear diffusion results (Sect.

Besides seeking optimal convergence of the algorithm, the implementation efficiency also favours the accelerated PT method; the approach is simple but efficient, making it possible to further implement advanced optimisations such as explicit shared memory usage and register handling. The choice of a Cartesian regular grid allows static and regular memory access patterns, resulting in access optimisation possibilities and balanced interprocess communications. Additionally, the absence of global reduction in the algorithm avoids severe bottlenecks. Finally, the amount of data transferred in the accelerated PT method is minimal, which allows achieving near-ideal scaling on distributed-memory systems, as reported in Sect.

The main task in the design of PT methods is the estimation of optimal iteration parameters. For that, the spectral radius of the finite-difference operator is often approximated based on the Gershgorin circle theorem

The determination of the optimal iterative parameters is thereby reduced to the search for the optimal values of the dimensionless physical numbers that describe the properties of the underlying physical process. The addition of new physical processes, such as heat conduction, two-phase flow and chemical reactions will lead to the natural emergence of new dimensionless parameters. Since many physical processes have a similar or even identical mathematical description, it is expected that the derivation of the accelerated PT method for such processes can be carried out similarly to those already developed. In this paper, such a derivation is provided for several important processes, namely, the linear and nonlinear diffusion, diffusion–reaction, non-stationary diffusion, and the visco-elastic Stokes problem. The efficiency of the accelerated PT method is demonstrated for essentially nonlinear problems, as well as for the problems with large contrasts in the material properties.

Recently,

The scalability of the accelerated PT method as a function of numerical resolution permits the prediction of the total iteration count, here for the nonlinear diffusion and the visco-elastic Stokes in 3D. The weak-scaling benchmark results provide the time per iteration as a function of the numerical resolution. Combining this information, it is possible to predict the time-to-solution or wall time (Table

Wall-time prognostic for resolving the nonlinear diffusion and the visco-elastic Stokes 3D Julia multi-GPU applications on 2197 (

In practical applications, the patterns of the flow may change drastically throughout the simulation owing to the spontaneous flow localisation or evolution of the interface between immiscible phases with significantly different properties. It is a requirement for the numerical method to be robust with respect to such changes. The iterative algorithm is expected to converge even in extreme cases, e.g. in the presence of sharp gradients across material properties, and the iteration parameters should be insensitive to arbitrary changes in the internal geometry. We present a parametric study to assess the robustness of the accelerated PT method for typical building blocks for geophysical applications. We considered shear- and buoyancy-driven flows with multiple randomly distributed inclusions in a viscous matrix as proxies for more realistic problem formulations. We show that our method is capable of modelling flows with viscosity contrasts up to 9 orders of magnitude. The values of optimal iteration parameters obtained by the means of systematic simulation runs do not change significantly for a wide range of material properties and internal configurations of the computational domain. We observe the significant slowdown in convergence for viscosity contrasts larger than 5 orders of magnitude in some of the considered cases. These results are expected, given the ill-conditioned problem and thus motivate development of a scalable preconditioner suitable for massively parallel GPU workloads. The application of a robust preconditioner, with reference to previous discussions, may help to partly alleviate slow convergence. However, for viscosity contrasts of 6 orders of magnitude and more, a significant increase in the number of iterations may be legitimate

The numerical application to resolve shear-banding in elasto-viscoplastic media in 3D supports the versatility and the robustness of the presented approach putting emphasis on successfully handling complex rheology. These examples complement recent studies employing the accelerated PT method to resolve spontaneous localisation owing to multi-physics coupling

The current HPC landscape redefines the rules governing applications' performance where the multi-core processors' massive parallelism imposes a memory-bound situation. Our work shows that simple dynamic relaxation schemes can be employed efficiently to solve parabolic and nonlinear systems, upon transforming the equations into hyperbolic expressions. The transformation includes the addition of physics-inspired terms which enable the possibility to retrieve optimal parameters that minimise the iteration counts. The optimisation procedure includes von Neumann stability and dispersion analysis for the pseudo-physical parameters. Moreover, the conciseness of the accelerated PT approach permits the applications to execute at effective memory throughput rate the approaching memory copy rates (a theoretical upper bound) of latest GPUs. Further, hiding communication behind computations permits the achievement of parallel efficiency of more than 96 % on various distributed-memory systems and up to 2197 GPUs. The physics we selected for the numerical experiments represent key building blocks to further tackle various multi-physics coupling, usually the combination of mechanical and diffusive processes. Our systematic results on the multi-inclusion setup with huge viscosity contrasts provide some preliminary results assessing the robustness of the accelerated PT method, which we further employ to resolve shear-band formation in 3D as a result of plastic yielding in E-VP materials. Our study paves the way for resolving coupled and nonlinear multi-physics applications in natural sciences and engineering on extremely high resolutions on the next generation of exascale-capable supercomputers, revamping elegant iterative techniques and implementing them with the portable Julia language.

Let the total pseudo-time required to reach convergence be

We study the convergence rate of the PT method by performing the dispersion analysis of Eq. (

We are interested in the exponential decay rate values

The stability analysis of Eq. (

Similar to the previous case, we reformulate the problem in terms of deviation from the exact solution and consider the typical term from the Fourier series expansion:

Solving the Eq. (

Total pseudo-time

The stability analysis of the damped wave equation in discretised form suggests that in a certain range of values of

Reformulating Eq. (

Introducing the auxiliary parameter

The number of iterations required to converge to a tolerance

The system of Eqs. (

Following the established procedure, we reformulate Eq. (

Substituting Eq. (

Depending on the values of the coefficients, the dispersion relation in Eq. (

For analysis, it is useful to recast the dispersion relation in Eq. (

The discriminant of the depressed cubic equation is

Equation (

By solving Eq. (

When the depressed cubic equation

Roots of the dispersion relation

Substituting Eq. (

It is useful to provide an analogy between the presented analysis and some previous studies, namely, PT continuation model by

We reformulate the damped wave equation, Eq. (

In all mentioned studies the numerical discretisation of Eqs. (

The various software developed and used in the scope of this study is licensed under MIT License. The latest versions of the code is available from GitHub at:

LR designed the original study, developed codes and algorithms, realised the Julia at scale implementation, performed scaling and performance benchmarks, created figures and edited the manuscript. IU developed codes and algorithms, performed the dispersion analysis and the parametric study, created figures and edited the manuscript. TD contributed to the original study design, realised the desktop-scale computation benchmark, performed the E-VP model investigations, created figures and edited the manuscript. SO developed the building blocks for implementing Julia at scale and the effective memory throughput metric, deployed Julia on Piz Daint, provided scalable architecture-agnostic PDE solver implementation and edited the manuscript. YYP provided early work on accelerated PT methods, helped out with the dispersion analysis, contributed to methodological advances to allow PDE solver implementations on GPUs.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We thank Boris Kaus, Stefan Schmalholz and Mauro Werder for stimulating discussions. We thank Lawrence H. Wang, Boris Kaus and an anonymous reviewer for their suggestions to improve the manuscript.
We acknowledge the Swiss Geocomputing Centre, University of Lausanne, for computing resources on the

This research has been supported by the Swiss University Conference and the Swiss Council of Federal Institutes of Technology through the Platform for Advanced Scientific Computing (PASC) program. This research has been supported by the Swiss National Supercomputing Centre (CSCS) (project ID c23). This research has been supported by the Russian Science Foundation (grant no. 19-47-02015). Publisher's note: the article processing charges for this publication were not paid by a Russian or Belarusian institution.

This paper was edited by James Kelly and reviewed by Lawrence Hongliang Wang, Boris Kaus, and one anonymous referee.