Modelling thermomechanical ice deformation using a GPU-based implicit pseudo-transient method (FastICE v1.0)

Accurate predictions of future sea level rise require numerical models that capture the complex thermomechanical feedbacks in rapidly deforming ice. Shear margins, grounding zones and the basal sliding interface are locations of particular interest where the stress-field is complex and fundamentally three-dimensional. These transition zones are prone to thermomechanical localisation, which can be captured numerically only with high temporal and spatial resolution. Thus, better understanding the coupled physical processes that govern these boundaries of localised strain necessitates a non-linear, full Stokes 5 model that affords high resolution and scales well in three dimensions. This paper’s goal is to contribute to the growing toolbox for modelling thermomechanical deformation in ice by levering GPU accelerators’ parallel scalability. We propose a numerical model that relies on pseudo-transient iterations to solve the implicit thermomechanical coupling between ice motion and temperature involving shear-heating and a temperature-dependant ice viscosity. Our method is based on the finite-difference discretisation, and we implement the pseudo-time integration in a matrix-free way. We benchmark the mechanical Stokes 10 solver against the finite-element code Elmer/Ice and report good agreement among the results. We showcase a parallel version of the solver to run on GPU-accelerated distributed memory machines, reaching a parallel efficiency of 93%. We show that our model is particularly useful for improving our process-based understanding of flow localisation in the complex transition zones bounding rapidly moving ice.


15
The fourth IPCC report (Solomon et al., 2007) revealed that existing ice sheet flow models do not accurately describe polar ice sheet discharge (e.g., Gagliardini et al., 2013;Pattyn et al., 2008) owing to their inability to simultaneously model slow and fast ice flow (Gagliardini et al., 2013;Bueler and Brown, 2009). This issue results from the fact that many ice flow models are based on simplified approximations of non-linear Stokes equations, such as shallow ice models (Bueler and Brown, 2009;Bassis, 2010;Schoof and Hindmarsh, 2010;Goldberg, 2011;Egholm et al., 2011;Pollard and DeConto, 2012;Perego et al., 2012;20 Tezaur et al., 2015). Shallow ice models are computationally more tractable and describe the motion of large homogeneous portions of ice as a function of the basal friction. However, this category of models fails to capture the coupled multi-scale processes that govern the behaviour of the boundaries of streaming ice, including shear margins, grounding zones and the EISMINT (Huybrechts and Payne, 1996) and ISMIP (Pattyn et al., 2008). There is only one model inter-comparison that investigates the coupled thermomechanical dynamics, EISMINT 2 . Unfortunately, experiments in EISMINT 2 are usually performed using a coupled thermomechanical first-order shallow ice model (Payne and Baldwin, 2000;Saito et al., 2006;Hindmarsh, 2006;Bueler et al., 2007;Hindmarsh, 2009;Brinkerhoff and Johnson, 2015) making the comparison to our full Stokes implementation less immediate. Although thermomechanically coupled Stokes models exist (Zwinger et al.,70 2007; Leng et al., 2014;Schäfer et al., 2014;Gilbert et al., 2014;Zhang et al., 2015;Gong et al., 2018), very few studies have investigated key aspects of the implemented model, such as convergence among grid refinement and impacts of one-way vs.
We start by providing an overview over the mathematical model, describing ice dynamics and its numerical implementation. We then discuss GPUs capabilities and explain our GPU implementation. We further report model comparison against 75 a selection of benchmark studies, followed by sharing the results and performance measurements. Finally, we discuss pros and cons of the method, and highlight glaciological contexts in which our model could prove useful. The codes examples based on the PT method in both MATLAB and CUDA C programming language are available for download from Bitbucket at https://bitbucket.org/lraess/ice/ and from http://wp.unil.ch/geocomputing/software/ice.

The mathematical model
We capture the flow of an incompressible, non-linear, viscous fluid -including a temperature-dependent rheology. Since ice is approximately incompressible, the equation for conservation of mass reduces to: where v i is the velocity component in the spatial direction x i .

85
Neglecting inertial forces, ice's flow is driven by gravity and is resisted by internal deformation and basal stress: where F i = ρg sin (α)[1, 0, − cot (α)] is the external force. Ice density is denoted by ρ, g is the gravitational acceleration, and α is the characteristic bed slope. P is the isotropic pressure and τ ij is the deviatoric stress tensor. The deviatoric stress tensor τ ij is obtained by decomposing the Cauchy stress tensor σ ij in terms of deviatoric stress τ ij and isotropic pressure P .

90
In the absence of phase transitions, the temporal evolution of temperature in deforming, incompressible ice is governed by advection, diffusion and shear-heating: where T represents the temperature deviation from the initial temperature T 0 , c is the specific heat capacity, k is the spatiallyvarying thermal conductivity and˙ ij is the strain-rate tensor. The term τ ij˙ ij represents the shear-heating, a source term that 95 emerges from the mechanical model.
Shear-heating could locally raise the temperature in the ice to the pressure melting point. Once ice has reached melting point, any additional heating is converted to latent heat, which prevents further temperature increase. Thus, we impose a temperature cap at the pressure melting point, following Suckale et al. (2014), by describing the melt production using a heavy-side function where T m stands for the ice melting temperature. We balance the heat produced by shear-heating with a sink term in regions where the melting temperature is reached. The volume of produced meltwater can be calculated in a similar way as proposed by Suckale et al. (2014).
We approximate the rheology of ice through Glen's flow law (Glen, 1952;Nye, 1953): where a 0 is the pre-exponential factor, R is the universal gas constant, Q is the activation energy, n is the stress exponent, and τ II is the second invariant of the stress tensor defined by τ II = 1/2τ ij τ ij . Glen's flow law posits an exponent of n = 3.
At the ice top surface Γ t (t), we impose the upper surface boundary condition σ ij n j = −P atm n j , where n j denotes the normal unit vector at the ice surface boundary, and P atm the atmospheric pressure. Because atmospheric pressure is negligible 110 relative to pressure within ice column, we can also use a standard stress-free simplification of the upper surface boundary condition σ ij n j = 0. On the bottom ice-bedrock interface, we can impose two different boundary conditions. For the parts of the ice-bedrock interface Γ 0 (t) where the ice is frozen to the ground, we impose a zero velocity v i = 0 and thus no sliding boundary condition. On the parts of ice-bedrock interface Γ s (t) where the ice is at the melting point, we impose a Rayleigh v i n i = 0 , where the parameter β 2 denotes a given sliding coefficient, n i denotes the normal unit vector at the ice-bedrock interface, and t j denotes any unit vector tangential to the bottom surface. On the side or lateral boundaries, we impose either Dirichlet boundary conditions if the velocities are known, or periodic boundary conditions, mimicking an infinitely extended domain.

120
For numerical purposes and for ease of generalisation, it is often preferable to use non-dimensional variables. This allows one to limit truncation errors (especially relevant for single-precision calculations) and to scale the results to various different initial configurations. Here, we use two different scale sets, depending on whether we solve the purely mechanical part of the model or the thermomechanically coupled system of equations.
In the case of an isothermal model, we use ice thickness, H, and gravitational driving stress to non-dimensionalise the 125 governing equations: where A 0 is the isothermal deformation rate factor and α is the mean bed slope. We can then rewrite the governing equations in their non-dimensional form as follows: The model parameters are the mean bed slope α and domain size in each horizontal direction, i.e. L x and L y .
Reducing the thermomechanically coupled equations to a non-dimensional form requires not only length and stress, but also temperature and time. We choose the characteristic scales such that the coefficients in front of the diffusion and shear-heating These choices entail that the velocity scale in the thermomechanical model is v = L/t. We obtain the non-dimensional (primedvariables) by using the characteristic scales given in Eq. (9), which leads to: where F i is now defined as F i = F [1, 0, − cot (α)] and F = ρg sin (α)L/τ . The model parameters are the non-dimensional 140 initial temperature T 0 , the stress exponent n, the non-dimensional force F , the mean bed slope α, non-dimensional domain height L z , and the horizontal domain size L x and L y (Figure 3). We motivate the chosen characteristic scales by their usage in other studies of thermomechanical strain localisation Kiss et al., 2019). In the interest of a simple notation, we will omit the prime symbols on all non-dimensional variables in the remainder of the paper. 145 We consider a specific 1-D mathematical case where all horizontal derivatives vanish (∂/∂x = ∂/∂y = 0). The only remaining shear stress component τ xz and pressure P are determined by analytical integration and are constant in time considering a fixed domain (Figure 3). We assume that stresses vanish at the surface and we set both horizontal and vertical basal velocity components to 0. We then integrate the 1-D mechanical equation in the vertical direction and substitute it into the temperature equation, which leads to:

A simplified 1-D semi-analytical solution
Notably, the velocity and shear-heating terms (Eq. 11) are now a function only of temperature and, thus, of depth and time. To obtain a solution of the coupled system, one only needs to numerically solve for the temperature evolution profile, while the velocity can then be obtained diagnostically by a simple numerical integration.

155
We discretise the coupled thermomechanical Stokes equations (Eq. 10) using the finite-difference method on a staggered Cartesian grid. Among many numerical methods currently used to solve partial differential equations, the finite-difference method is commonly used and has been successfully applied in solving a similar equations' set relating to geophysical problems in geodynamics (Harlow and Welch, 1965;Ogawa et al., 1991;Gerya, 2009). The staggering of the grid provides second-order accuracy of the method (Virieux, 1986;Patankar, 1980;Gerya and Yuen, 2003;McKee et al., 2008), avoids oscillatory pressure 160 modes (Shin and Strikwerda, 1997), and produces simple yet highly compact stencils. The different physical variables are located at different locations on the staggered grid. Pressure nodes and normal components of the strain-rate tensor are located at the cell centres. Velocity components are located at the cell mid-faces ( Figure 1), while shear stress components are located at the cell vertices in 2-D (e.g., Harlow and Welch, 1965). The resulting algorithms are well suited for taking advantage of modern many-core parallel accelerators, such as graphical processing units (GPUs) (Omlin, 2017;Räss et al., 2018;Duretz 165 et al., 2019;Räss et al., 2019a). Efficient parallel solvers utilising modern hardware provide a viable solution to resolve the computationally challenging coupled thermomechanical full Stokes calculations in 3-D. The power law viscous ice rheology (Eq. 5) exhibits a non-linear dependence on both the temperature and the strain-rate: where˙ II is the square root of the second invariant of the strain-rate tensor˙ II = 1/2˙ ij˙ ij . We regularise the strain-rate and 170 temperature dependant viscosity η to prevent non-physical values for negligible strain-rates, η reg = 1/(η −1 + η −1 0 ). We use a harmonic mean to obtain a naturally smooth transition to background viscosity values at negligible strain-rate η 0 . We define temperature on the cell centres within our staggered grid. We discretise the temperature equation's advection term using a first-order upwind scheme, doing the physical time integration using either an implicit backward Euler or a Crank-Nicolson (Crank and Nicolson, 1947) scheme. 175 We rely on a pseudo-transient (PT) continuation or relaxation method to solve the system of coupled non-linear partial differential equations (10) in an iterative and matrix-free way (Frankel, 1950;Cundall et al., 1993;Poliakov et al., 1993;Kelley and Keyes, 1998;Kelley and Liao, 2013). To this end, we reformulate the thermomechanical Eq. (10) in a residual form: The right-hand-side terms (f p , f vi , f T ) are the non-linear continuity, momentum and temperature residuals, respectively, and 180 quantify the magnitude of the imbalance of the corresponding equations.
We augment the steady-state equations with PT terms using the analogy of physical transient processes such as the bulk compressibility or the inertial terms within the momentum equations . This formulation enables us to integrate the equation forward in pseudo-time τ until we reach the steady-state (i.e. the pseudo-time derivatives vanish). Relying on transient physics within the iterative process provides well-defined (maximal) iterative time step limiters. We reformulate Eq. (10): where we introduced the pseudo-time derivatives ∂/∂τ for the continuity (∂P/∂τ p ), the momentum (∂v i /∂τ vi ), and the temperature (∂T /∂τ T ) equation.
For every non-linear iteration k, we update the effective viscosity η eff [k] in the logarithmic space by taking a fraction θ η of 190 the actual physical viscosity η [k] using the current strain-rate and temperature solutions fields and a fraction (1 − θ η ) of the effective viscosity calculated in the previous iteration η eff [k−1] .
where θ η (0 ≤ θ η ≤ 1) is a viscosity relaxation factor. This relaxation of the non-linearity allows the effective viscosity to iteratively approach its physical value within the pseudo-transient iterations. A similar non-linear viscosity relaxation approach 195 was successfully implemented in the ice sheet model development by Tezaur et al. (2015).
The pseudo-time integration of Eq. (14) leads to the definition of pseudo-time steps ∆τ p , ∆τ vi and ∆τ T , for the continuity, momentum and temperature equations, respectively. Transient physical processes such as compressibility (continuity equation) or acceleration (momentum equation) dictate the maximal allowed explicit pseudo-time step to be utilised in the transient process. Using the largest stable steps allows one to minimise the iteration count required to reach the steady-state: where n dim is the number of dimensions, ∆x i and n i are the grid spacing and the number of grid-points in the i direction (i = x in 1-D, x, z in 2-D and x, y, z in 3-D), respectively. The physical time step, ∆t, advances the temperature in time.
The pseudo-time step ∆τ T is an explicit Courant-Friedrich-Lewy (CFL) time step that combines temperature advection and diffusion. Similarly, ∆τ vi is the explicit CFL time step for viscous flow, representing the diffusion of strain-rates with viscosity 205 as the diffusion coefficient. It is modified to account for the numerical equivalent of a bulk viscosity η b . We choose ∆τ p to be the inverse of ∆τ vi to ensure that the pressure update is proportional to the effective viscosity, while the velocity update is sensitive to the inverse of the viscosity. This interdependence reduces the iterative method's sensitivity to the variations in the ice's viscosity. eventually reach the incompressible solution. The relaxation of the incompressibility constraint is analogous to the penalisation of pressure pioneered by Chorin (1967Chorin ( , 1968, and built on extensively subsequently. Compared to projection-type methods, it has the advantage that no pressure boundary condition is necessary that will lead to numerical boundary layers (Weinan and Liu, 1995). We use the parameter η b to balance the divergence-free formulation of strain-rates in the normal stress component evaluation, where it is multiplied with the pressure residual f p . Thus, normal stress is given by convergence of the method, the pressure residual f p vanishes and the incompressible form of the normal stresses is recovered.
Combining the residual notation introduced in Eq. (13), with the pseudo-time derivatives in Eq. (14) leading to the update rules: where the pressure, velocity and temperature iterative increments represent the current residual [k] multiplied by the pseudo-220 time step: The straight-forward update rule (Eq. 17) is based on a first-order scheme (∂/∂τ ). In 1-D, it implies that one needs N 2 iterations to converge to the stationary solution, where N stands for the total number of grid-points. This behaviour arises because the time step limiter ∆τ vi implies a second-order dependence on the spatial derivatives for the strain-rates. In contrast, 225 a second-order scheme (Frankel, 1950), ∂ 2 /∂τ 2 + ∂/∂τ invokes a wave-like transient physical process for the iterations.
The main advantage is the scaling of the limiter as ∆x instead of ∆x 2 in the explicit pseudo-transient time step definition. We can reformulate the velocity update as: where α can be expanded to (1 − ν/n i ) and acts like a damping term on the momentum residual. A similar damping approach 230 is used for elastic rheology in the FLAC  geotechnical software in order to significantly reduce the number of iterations needed for the algorithm to converge. The optimal value of the introduced parameter ν is found to be in a range (1 ≤ ν ≤ 10), and it is usually problem-dependent. This approach was successfully implemented in recent PT developments by Räss et al. (2018Räss et al. ( , 2019a and Duretz et al. (2019).
Notably, the PT solution procedure leads to a two-way numerical coupling between temperature and deformation (mechan-  the coupling terms, rheology is also treated implicitly, i.e. the shear viscosity η is always evaluated using the current physical temperature, T , and strain-rate,˙ II . Our method is fully local. At no point during the iterative procedure does one need to perform a global reduction, nor to access values that are not directly collocated. These considerations are crucial when designing a solution strategy that targets parallel hardware such as many-core GPU accelerators. We implemented the PT method in the 240 MATLAB and CUDA C programming languages. Computations in CUDA C can be performed in both double and single precision arithmetic. The computations in CUDA C shown in the remainder of the paper were performed using double-precision arithmetic.
3 Levering hardware accelerators 3.1 Implementation on graphical processing units

245
Our GPU algorithm development effort is motivated by the aim to resolve the coupled thermomechanical system of equations (Eq. 12-13) with high spatial and temporal accuracy in 3-D. To this end, we exploit the low-level intrinsic parallelism of shared memory devices, targeting particularly GPUs. A GPU is a massively parallel device originally devoted to render the colour values for pixels on a screen independently from one another where the latency can be masked by high throughput (i.e. compute as many jobs as possible in a reasonable time). A schematic representation ( Figure 2) highlights the conceptual discrepancy 250 between GPU and CPU. On the GPU chip, most of the area is devoted to the arithmetic units, while on the CPU, a large area of the chip hosts scheduling and control microsystems.
The development of GPU-based solvers requires that one devote time to the design of new algorithms that lever the massively parallel potential of the current GPU architectures. Considerations such as limiting the memory transfers to the mandatory minimum, avoiding complex data layouts, preferring matrix-free solvers with low memory footprint, and optimal parallel 255 scalability instead of classical Direct-Iterative solver types (Räss et al., 2019a) are key in order to achieve optimal performance.

Multi-GPU implementation
We rely on a distributed memory parallelisation using the message passing interface (MPI) library to overcome the on-device memory limitation inherent to modern GPUs and exploit supercomputers' computing power. Access to a large number of parallel processes enables us to tackle larger computational domains or to refine grid resolution. We rely on domain decomposition 260 to split our global computational domain into local domains, each executing on a single GPU handled by an MPI process. Each local process has its boundary conditions defined by a) physics if on the global boundary or b) exchanged information from the neighbouring process in case of internal boundaries. We use CUDA-aware non-blocking MPI messages to exchange the internal boundaries among neighbouring processes. CUDA-awareness allows us to bypass explicit buffer copies on the host memory by directly exchanging GPU pointers resulting in an enhanced workflow pipe-lining. Our algorithm implementation 265 and solver requires no global reduction. Thus, there is no need for global MPI communication, eliminating an important potential scaling bottleneck. Although the proposed iterative and matrix-free solver features a high locality and should scale by construction, the growing number of MPI processes may deprecate the parallel runtime performance by about 20% owing to the increasing number of messages and overall machine occupancy (Räss et al., 2019b). We address this limitation by overlapping MPI communication and the computation of the inner points of the local domains using streams, a native CUDA feature.

270
CUDA streams allow one to assign asynchronous kernel execution and thus enable the overlap between communication and computation, resulting in optimal parallel efficiency.

280
Experiments 1 and 2 seek to first verify the implementation of the mechanical part of the Stokes solver, which is the computationally most expensive part (Eq. 8). For these experiments, we assume that the ice is isothermal and neglect temperature.
We compare our numerical solutions to the solutions obtained by the commonly used finite-element Stokes solver Elmer/Ice (Gagliardini et al., 2013), which has been thoroughly tested (Pattyn et al., 2008;Gagliardini and Zwinger, 2008). Experiment 3 is a thermomechanically coupled case. The model parameters are the stress exponent n, the mean bed slope α and the two 285 horizontal distances L x and L y in their respective dimensions (x, y), and appear in Table 1. If a linear basal sliding law (Eq. 6) is prescribed, the respective 2-D and 3-D sliding coefficients are: where β 0 is a chosen non-dimensional constant. Differences may arise depending on the prescribed values for the parameters α, L x , L y and β 0 . Experiment 2 represents the ISMIP experiments C and D for L = 10 km (Pattyn et al., 2008), but in our case 290 using non-dimensional variables.
The mechanical part of Experiment 3 is analogous to Experiment 2. The boundary conditions are periodic in x and y directions. The thermal problem requires additional boundary conditions in terms of temperature or fluxes. We set the surface temperature T 0 to 0. At the bottom, we set the vertical flux q z to 0 and, on the sides, we impose periodic boundary conditions.  ( Figure 5b,d,f). We find good agreement between the two model solutions. We employed a numerical resolution grid resolution

Experiment 2: Stokes flow with basal sliding
We now consider the case where ice is sliding at the base (ISMIP experiments C and D). We prescribe periodic boundary  vertical direction. Figure 6 shows the results of the 2-D simulation of Experiment 2, where we employed a numerical grid resolution of 511 × 127 grid-points (≈ 1.95 × 10 5 DOF) for the PT GPU-based solver and computed the Elmer/Ice solution using a numerical grid resolution of 241 × 120 (≈ 8.7 × 10 4 DOF). We show both v x and v z velocity components at the slab's surface. The two models' results agree well.

325
The 3-D simulation results for Experiment 2 appear in Figure 7. The upper panels (Figure 7a,c,e) show the spatial pattern in the three surface velocity components v x , v y and v z computed with our PT GPU-based solver. The lower panels ( Figure   7b,d,f) compare the three surface velocity components at y ≈ L y /4 computed by our PT GPU-based solver to Elmer/Ice. We We find good agreement between the two numerical implementations, despite some discrepancies in the horizontal velocity component v y . A potential explanation for the minor mismatch is the fact that the finite-element grid does not exactly coincide with the location y = L y /4 in Elmer/Ice, which may be resolved by specifically pinning nodes of the finite-element mesh.

335
Since the flow is mainly oriented in the x direction, the v y velocity component is more than two orders of magnitude smaller than the v x velocity component. Numerical errors in v y are more apparent than in the leading velocity component v x . We report a one-order magnitude increase in the time-to-solution in Experiment 2 compared to the Experiment 1 configuration owing to the periodicity on the lateral boundaries.  We ensure that all results collapse onto the semi-analytical 1-D model solution (Section 2.3), which we obtained by analytically integrating the velocity field and solving the decoupled thermal problem separately (Eq. 11). From a computational perspective, we numerically solve Eq. 11 using a high spatial and temporal accuracy and therefore minimise the occurrence of numerical errors. We establish the 1-D reference solution for both the temperature and the velocity profile, solving Eq. 11 on a regular grid, reducing the physical time steps until we converge to a stable reference solution. Our reference simulation  The semi-analytical 1-D solution enables us to evaluate the influence of the numerical coupling method and time integration and to quantify when and why high spatial resolution is required in thermomechanical ice flow simulations. We compare the 1-D semi-analytical reference solution (Eq. 11) to the results obtained with the 1-D PT-based solver for three spatial numerical 365 resolutions (n z =31, 95 and 201 grid-points) at three non-dimensional times 1 × 10 8 , 2 × 10 8 and 2.9 × 10 8 ( Figure 10). The grey area in Figure 10 highlights where the melting temperature is exceeded. Since our semi-analytical reference solution does not include phase transitions, we also neglect this component in the numerical results. During the early stages of the simulation, the thermomechanical coupling is still minor and solutions at all resolution levels are in good agreement with one another and with the reference. The low resolution solution starts to deviate from the reference (Figure 10b) when the coupling become 370 more pronounced close to the thermal runaway point (Clarke et al., 1977). The high spatial resolution solution is satisfactory at all stages. We conclude that high spatial resolutions is required to accurately capture the non-linear coupled behaviour in regimes close to the thermal runaway, which is seldom the case in the models reported in the literature.

Experiment 3: Thermomechanically coupled Stokes flow without basal sliding
Thermomechanical strain localisation may significantly impact on the long-term evolution of a coupled system. A recent study by Duretz et al. (2019) suggested that partial coupling may result in under-estimating the thermomechanical localisation resolution runs, respectively. We compare the results for non-dimensional time 1 × 10 8 , 2 × 10 8 and 2.9 × 10 8 . The shaded areas correspond to the part of the solution that is above the melting temperature, since we do not account for phase transitions in this benchmark.
using an implicit backward Euler method for (1) and (2)     All three models start with identical initial conditions for the thermal problem, i.e. ∆T = 0 throughout the entire ice slab.
The difference between the models arises owing to different stress distributions in 1-D, 2-D or 3-D. For instance, the additional stress components inherent in 2-D and 3-D are in the same order of magnitude as the 1-D shear stress for the considered aspect 410 ratio, reducing the horizontal velocity v x in the 2-D and 3-D models. This also impacts on the shear-heating term, reducing the source term in the temperature evolution equation. In the 1-D configuration, the unique shear stress tensor component is a function only of depth. On the other end-member, the 3-D configurations allow for a spatially more distributed stress state. They lower strain-rates in this scenario and reduce the magnitude of shear-heating in higher dimensions. The spatially heterogeneous temperature and strain-rate fields in all directions require the utilisation of sufficiently high spatial numerical resolution in all 415 directions in order to accurately resolve spontaneous localisation.
We did not consider phase transition in the previous experiments for the sake of model comparison and because the analytical solution excluded this process. The existence of a phase transition caps the temperature at the pressure melting point in regions Figure 14. Experiment 3 includes a phase transition owing to melting. We report the evolution in time of non-dimensional temperature variation ∆T along a vertical profile picked at location x = Lx/2 within a 2-D run from Experiment 3. For this purpose, we run the 2-D PT GPU-based models from Experiment 3 for a duration of 2.9 × 10 9 .
with pronounced shear-heating, as illustrated in 2-D in Figure 14. The simulation represents the thermomechanically coupled Experiment 3 with no-sliding and heat impermeable walls (similar to Figure 13). Meltwater production consumes excess 420 heat generated by shear-heating. Thus, melting provides a physical mechanism that avoids thermal runaway in shear-heating dominated zones in the ice. The experiment duration in dimensional units is 70 years, and the maximal temperature increase is 10 • C upon reaching the melting point.

The computational performance
We used two metrics to assess the performance of the developed PT algorithm: the effective memory throughput (MTP eff ) 425 and the wall-time. We first compare the effective memory throughput of the vectorised MATLAB CPU implementation and the single-GPU CUDA C implementation. We employ double-precision (DP) floating-point arithmetic in CUDA C for fair comparison. Second, we employ the wall-time metric to compare the performance of our various implementations (MATLAB, CUDA C) and compare these to the time-to-solution of the Elmer/Ice solver.
We use two methods to solve the linear system in Elmer/Ice. In the 2-D experiments, we use a direct method and in 3-D, 430 an iterative method. The direct method used in 2-D relies on the UMFPACK routines to solve the linear system. To solve the 3-D experiments, we employ the available bi-conjugate gradient stabilised method (BICGstab) with an ILU0 preconditioning.
We employ the configuration in Experiment 1 for all the performance measurements. We use an Intel i7 4960HQ 2.6 GHz (Haswell) four-core CPU to benchmark all the CPU-based calculations. For simplicity, we only ran single-core CPU tests, staying away from any CPU parallelisation of the algorithms. Thus, our MATLAB or the Elmer/Ice single-core CPU results

435
are not representative of the CPU hardware capabilities, and are only reported for reference. core processors, since the current chip design tends to large flop-to-byte ratios. Over the past years and decades, the memory bandwidth increase has been much slower compared to the increase in the rate of floating-point operations.
As shown by Omlin (2017) and Räss et al. (2019a), a relevant metric to assess the performance of memory-bounded algorithms is the effective memory throughput (MTP eff ) (Eq. 21). The MTP eff determines how efficiently data is transferred between the main memory and the arithmetic units and is inversely proportional to the execution time: where (n x n y n z ) stands for the total number of grid-points, n iter is the total number of numerical iterations performed, n p is the arithmetic precision (single -4 bytes or double -8 bytes), t nt is the wall-time in seconds needed to compute the n iter iterations, and n IO is the performed number of memory accesses. It represents the minimum number of memory operations (read-andwrite or read only) required to solve a given physical problem. For instance, in the mechanical Stokes solver for Experiment The 3-D performance results obtained on various available Nvidia GPUs are summarised in Figure 16). We performed all the calculations using double-precision arithmetic. We compare the MTP eff and wall-time values as functions of the DOF.
We tested GPUs from various price ranges and chip generations, targeting entry-level GPUs such as the Nvidia Quadro P1000 (Pascal), high-end gaming cards such as the Nvidia Titan Black (Kepler) or the Nvidia Titan X (Maxwell), and data-centreclass GPU accelerators such as the Nvidia Tesla V100 PCIe (Volta recent GPUs such as the data-centre Tesla V100 (Volta) offer a significant (order of magnitude higher) performance increase than entry-level GPU accelerators, such as the Quadro P1000.
We share the performance of the GPU-MPI implementation of our solver to execute on distributed memory machines. We achieved a weak scaling parallel efficiency of 93% on the 128 Nvidia Titan X (Maxwell) GPUs on the octopus supercomputer at the Swiss Geocomputing Centre, University of Lausanne, Switzerland. As baseline, we employed a non-MPI single GPU 485 calculation. We then repeated the experiment using 1 to 128 MPI (thus GPUs) processes and report the normalised execution time ( Figure 17). The effective drop in parallel efficiency is only 4% involving 1 to 128 MPI processes. We achieved this close-to-optimal parallel efficiency by overlapping MPI message communication and local domain stencil calculations. We specifically employed a CUDA stream in order to execute the communication and computation overlap asynchronously. We performed similar experiment on the volta node, an 8 Nvidia Tesla V100 32 GB (Nvlink Volta) based computer node (analogous 490 to Nvidia's DGX-1 box), reporting a parallel efficiency of 0.985% for a single MPI process running at 0.99% of the non-MPI reference.

Discussion
Numerically resolving thermomechanical processes in ice is vital for improving our understanding of the complex behaviour of ice sheets and glaciers. To date, very few studies have investigated the numerical aspects of thermomechanically coupled Stokes 495 solvers (e.g., Duretz et al., 2019). Existing assessments (e.g., Zhang et al., 2015) usually employed low spatial resolution, and did not address the influence of the numerical implementation of multi-physics coupling strategies or the role of numerical time integration. To avoid the significant computational expense of a thermomechanically coupled full Stokes model, many studies relied either on the computationally less expensive shallow ice approximations, linear or linearised Stokes models, or low spatial resolutions. None of the approaches have resolved the multi-physics and multi-scale processes governing the 500 boundaries of streaming ice, including shear margins, grounding zones and the basal interface.
To address these limitations, we have developed a new numerical model based on an iterative pseudo-transient finitedifference method. Our discretisation yields to a concise matrix-free algorithm well suited to use the intrinsic parallelism  Tesla V100 32 GB Nvlink data-centre GPUs. Both are available via the octopus supercomputer and the volta node, respectively. Note that the execution time baseline used to compute the parallel efficiency represents a non-MPI calculation.
The significant temperature dependence of ice's shear viscosity leads to pronounced spatial variations in the viscosity, which affects the convergence rate of our iterative PT method. Resolving shear flow localisation is challenging in this context, since it requires the simultaneous minimisation of errors in locations of the computational domain that are governed by different characteristic time scales. Our PT approach allows us to capture the resulting spatial heterogeneity and offers a physically-510 motivated strategy to locally ensure stability of the iterative scheme using local pseudo-time steps, analogous to diagonal preconditioning in matrix-based direct approaches. The conciseness and simplicity of the implementation allows us to explore influences of various coupling methods and time integrations in a straight-forward way. Similar arguments suggest that the PT approach is an interesting choice for educational purposes.
We quantify the scalability of our approach through extensive performance tests, where we investigated both the time-to-515 solution and the efficiency of exploiting the current hardware capabilities at their maximal capacities. To verify the accuracy and the coherence of the proposed results, we performed a set of benchmark experiments, obtaining excellent agreement with results from the widely used glacier flow model Elmer/Ice. Experiment 3 verifies that, under the assumption of periodic configurations, both 1-D, 2-D and 3-D models return matching results.
Further, we have tested the accuracy of our numerical solutions for different time integration schemes, including forward 520 (explicit) and backward (implicit) Euler and different physical time steps. The value of the numerical time step must be chosen as sufficiently small so as to resolve the relevant physical processes. We limited the maximal time step in the explicit time integration scheme by the CFL stability criterion for temperature diffusion. For high spatial numerical resolutions, the CFLbased time step restriction is sufficient to resolve the coupled thermomechanical process. However, this conclusion is not valid step values larger than the non-dimensional time (2×10 8 ) needed to raise the temperature. Thus, we did not sufficiently resolve the physical process. An implicit scheme for the time integration remedies the stability issue, but does not guarantee accuracy.
Independent of the numerical time integration scheme used, the range of time step values that resolve the coupled physics is close to the explicit stability criterion.
Our multi-GPU implementation of the thermomechanical PT solver achieved a close-to-ideal parallel efficiency featuring a 530 runtime drop of only 4% compared to a single MPI process execution (a 7% deviation from a single non-MPI GPU runtime).
We achieve this optimal domain decomposition parallelisation by overlapping communication and computation using native CUDA streams. This CUDA feature enables asynchronous compute kernel execution. Similar implementation and parallel scaling results were recently achieved for hydro-mechanical couplings (Räss et al., 2019a, b).

535
We have developed an iterative solver to efficiently exploit the capabilities of modern hardware accelerators such as GPUs.
We report rapid execution times on single-GPUs monitoring and optimising memory transfers. We achieved a close-to-ideal parallel efficiency (93%) on a weak scaling test up to 128 GPUs by overlapping MPI communication and computations. We implemented the coupled thermomechanical PDEs using our iterative PT approach in a straight-forward way from the mathematical model. The technical advances and utilisation of GPU accelerators enabled us to investigate the thermomechanical 540 coupling and to resolve the first-order physics governing the ice flow in 3-D on a high spatial and temporal resolution.
We benchmarked the mechanical solver of the coupled model against a community standard model Elmer/Ice in a set of experiments specifically designed to test the mechanical solver. We further investigated explicit and implicit coupling and time integration strategies. We report that the physical time step must be chosen with care. Sufficiently high temporal resolution is mandatory in order to accurately resolve the coupled physics. Although minor differences arise among uncoupled and coupled 545 approaches, we observe less localisation for uncoupled models compared to the fully coupled ones.
We established that a relatively high spatial numerical resolution is necessary to resolve the non-linear and spontaneous localisation of thermomechanically coupled ice flow, including more than 100 grid-points in the vertical direction. We stress that spatial variations in the horizontal plane can significantly impact on the ice flow dynamic, justifying high spatial numerical resolution in all directions. We finally reported that considering the full 3-D stress tensor can significantly slow down the 550 process of thermal runaway, which can ultimately be hindered by considering phase transitions.
GPUs are compact, affordable and relatively programmable devices that offer high performance throughput (close to TB/s peak memory throughput) and a good price to performance ratio. GPUs offer an attractive alternative to conventional CPUs owing to their massively parallel architecture featuring thousands of cores. The presented models lever this modern technology and enable us to gain further process-based understanding of ice-flow localisation. Resolving the coupled processes at very 555 high spatial and temporal resolutions provides future avenues to address current challenges in accurately predicting ice sheet dynamics. Competing interests. The authors declare that they have no conflicts of interest.