A hierarchical mesh reﬁnement technique for global 3-D spherical mantle convection modelling

. A method for incorporating multi-resolution capabilities within pre-existing global 3-D spherical mantle convection codes is presented. The method, which we term “ge-ometric multigrid reﬁnement”, is based upon the application of a multigrid solver on non-uniform, structured grids and allows for the incorporation of local high-resolution grids within global models. Validation tests demonstrate that the method is accurate and robust, with highly efﬁcient solutions to large-scale non-uniform problems obtained. Significantly, the scheme is conceptually simple and straightforward to implement, negating the need to reformulate and restructure large sections of code. Consequently, although more advanced techniques are under development at the fron-tiers of mesh reﬁnement and solver technology research, the technique presented is capable of extending the lifetime and applicability of pre-existing global mantle convection codes.

Whilst the uniform discretizations and algorithms used in such codes have their advantages in terms of storage, data structure and parallelization, they do not exploit computer power to its full potential, since local variations in resolution are not possible.Consequently, despite the large supercomputing clusters available today, these codes have difficulty in resolving the important fine-scale physics (i.e.thermal boundary-layers, upwelling plumes and downwelling slabs) within a high Rayleigh number global mantle convection simulation.The development of efficient multi-resolution numerical methods for such problems has become a major goal of current research.

D. R. Davies et al.: Geometric multigrid refinement techniques for mantle convection
In computational engineering, non-uniform resolution is usually attained via unstructured grids, with solution accuracy and computational efficiency improved through errorguided grid adaptivity (e.g.Peraire et al., 1987;Hassan et al., 1995;Nithiarasu and Zienkiewicz, 2000).Such techniques (and similar adaptive techniques that are based upon hierarchical mesh refinement) have recently been applied within the mantle dynamics community (e.g.Davies et al., 2007Davies et al., , 2008;;Stadler et al., 2010;Leng and Zhong, 2011), leading to the development of several state-of-the-art computational frameworks for simulating global mantle convection.The most notable examples are: (i) Fluidity (Davies et al., 2011;Kramer et al., 2012); (ii) ASPECT (Kronbichler et al., 2012); and (iii) RHEA (Stadler et al., 2010;Burstedde et al., 2013).Such codes, which employ cutting-edge methods in mesh refinement, solver technology and parallelisation, open up a whole new class of problems for mantle dynamics research, as demonstrated by Stadler et al. (2010).However, although perhaps more limited in their applicability, more established codes, which are based upon older numerical methods, remain heavily utilised within the community (e.g.Nakagawa and Tackley, 2008;Schuberth et al., 2009;Davies and Davies, 2009;Nakagawa et al., 2009;Zhang et al., 2010;Wolstencroft and Davies, 2011;Tan et al., 2011;Davies et al., 2012;Miller and Becker, 2012;Bower et al., 2013).A means to extend the lifetime and applicability of such codes is therefore highly desirable.
In this paper we introduce a method, which we term "geometric multigrid refinement" (e.g.Brandt, 1977;Thompson et al., 1992;Albers, 2000), that offers a practical solution to the limitations of current codes.The approach maintains the key benefits of the current uniform discretizations, but allows for local variations in resolution.It is conceptually simple and, perhaps most importantly, straightforward to implement.In addition, it is suitable for finite element, finite difference and finite volume schemes and, thus, is applicable to several codes within the community.
The paper begins with a general introduction to the underlying methodology.The numerical issues involved in implementing the scheme are then outlined.The technique is subsequently validated, using the well-established 3-D spherical mantle convection code TERRA as a basis: model predictions are compared with analytical and benchmark solutions (e.g.Hager and O'Connell, 1981;Richards and Hager, 1984;Bercovici et al., 1989;Stemmer et al., 2006;Choblet et al., 2007;Tackley, 2008;Zhong et al., 2008).Results indicate that the proposed methodology is highly successful, generating accurate solutions at a reduced computational cost.Although a thorough validation of TERRA is beyond the scope of this study, results also demonstrate that TERRA is robust and accurate for the class of problems examined herein.

Geometric multigrid refinement
Geometric multigrid (e.g.Brandt, 1984;Briggs et al., 2000) is an amalgamation of ideas and techniques, combining iterative solution strategies and a hierarchy of grids, to form a powerful tool for the numerical solution of differential equations.The basic idea behind the technique is to work not with a single grid, but with a sequence of grids ("levels") of increasing coarseness, to improve the slow convergence of classical iterative/relaxation methods (see, for example, Brandt, 1984, for further details).Multigrid schemes can be applied in combination with any of the common numerical discretization techniques and, consequently, have been widely used within the geodynamical community (e.g.Baumgardner, 1985;Tackley, 1996Tackley, , 2008;;Bunge et al., 1997;Zhong et al., 2000Zhong et al., , 2008;;Kameyama et al., 2005;Choblet et al., 2007).
Excluding the recent examples cited above, the mantle convection modelling community has generally applied multigrid to programs with uniform discretizations at each grid level (Fig. 1a).This makes programming more straightforward and avoids the computational overhead of dealing with varying mesh spacing.However, as outlined above, in global mantle convection simulations, uniform grids lead to an excessive problem size and, hence, models that are computationally inefficient.Grid-refinement is needed, predominantly in the system's boundary layers, whilst coarser resolution is often sufficient away from the boundary layers, where the solution is smoother.Fortunately, geometric multigrid algorithms are not restricted to truly uniform discretizations.
The approach described here recovers the flexibility of non-uniform grids by exploiting the fact that the various grids used in the usual multigrid cycles need not extend over the whole domain (e.g.Brandt, 1977;Bai and Brandt, 1987;Thompson et al., 1992;Lopez and Casciaro, 1997;Albers, 2000).The finest levels may be confined to progressively smaller subdomains, thereby providing higher resolution where required.These "local patches" are treated identically to "global" grids in the multigrid algorithm, only that their boundary values are obtained via interpolation from coarser grids, where needed.In such a structure, the effective mesh-size in each region will be that of the finest grid covering it.It is the limited extent of the fine-grid that provides the benefits to the method.
To illustrate this concept, consider a simple domain, consisting of four grid levels that are discretized by quadrilateral elements (Fig. 1b).Suppose grid level one and two extend over the entire domain, as is standard practice for multigrid programs.However, grid level three is confined to a smaller region, in the domains lower-right-hand quadrant.Grid four complements grid three, with further localized element subdivision.Thus, the final non-uniform grid is made up of four distinctive grid levels.A standard bisection refinement rule is employed; each quadrilateral element is split into four elements at the next grid level; (b) a non-uniform grid and the uniform levels it is made of.In essence, a non-uniform grid is a union of uniform sub-grids.However, unlike the traditional grids utilized in a multigrid, the sub-grids do not necessarily extended over the same domain.
This structure is highly flexible, since local grid refinement (or coarsening) is done by extending (or contracting) uniform grids, which is relatively easy and inexpensive to implement.Recurrent operators can be used for both relaxation and transfer procedures and a simple data structure can be employed.Furthermore, the use of partial grids leads to a considerable saving in both computational memory and operations, especially when only a small region of the domain requires grid refinement (such as the boundary layers).There does, however, appear to be a certain waste in the proposed system, as one function value may be stored several times, when its grid point belongs to several levels.This is not the case.Firstly, the amount of such extra storage is small, being less than 2 −d of the total storage, for a d-dimensional problem (Brandt, 1977).Moreover, the stored values are exactly those needed for the multigrid solution process.
This method of local refinements is based upon the Full Approximation Storage (FAS) mode of multigrid processing, where the full approximation is stored at all grid levels (see Brandt, 1977Brandt, , 1984)); in parts of the domain not covered by the finer grid, the coarser grid must show the full solution, not just a correction, as occurs with the correction scheme (CS) mode of multigrid processing.

Implementation
The key aspects involved with implementing the multigrid refinement technique within a pre-existing 3-D spherical mantle convection code are covered in this section.The wellestablished code TERRA is utilized to illustrate and validate the key ideas, although, as noted previously, the methodology is equally applicable to other codes and, hence, the findings of this study will be of benefit to the wider geodynamical community.For completeness, a brief overview of TERRA is first presented.This is intended to: (i) provide the reader with a background to the code's fundamental building blocks; and (ii) summarize recent developments to the code.

TERRA
TERRA is a well-established finite element mantle convection code that was first developed by Baumgardner (1985) and has been further modified by Bunge et al. (1996Bunge et al. ( , 1997) ) and Yang and Baumgardner (2000).The code solves the equations governing mantle convection inside a 3-D spherical shell with appropriate boundary conditions.Assuming incompressibility and the Boussinesq approximation (e.g.McKenzie et al., 1974), these equations, expressed in their non-dimensional form, are: where, u is the fluid velocity vector, p denotes dynamic pressure, T temperature, t time, κ thermal diffusivity, g gravitational acceleration, µ dynamic viscosity and k the unit radial vector.Note that the above non-dimensional equations are obtained from the following characteristic scales: mantle depth d; time d 2 /κ; and temperature T .The spherical shell is discretized by an icosahedral grid (Baumgardner and Frederickson, 1985).By projecting the regular icosahedron onto a sphere, the spherical surface can be divided into twenty identical spherical triangles, or ten identical diamonds, each of which contains one of the ten triangles surrounding each pole.Triangles can subsequently be subdivided into four triangles by construction of great circle arcs between triangle side mid-points.This refinement process can be repeated to yield an almost uniform triangulation of the sphere at any desired resolution.Refinements to the grid and, hence, lateral resolution, are referenced by mtthe number of grid intervals along an icosahedral diamond edge.The number of nodes on a spherical surface is given by 10mt 2 + 2 (there are ten icosahedral diamonds on each surface and two polar nodes).The grid is extended radially by placing several of these spherical shells above one another, generating a mesh of triangular prisms (layers) with spherical ends.The number of radial layers, nr, is flexible, but is usually set to mt/2.The total number of nodes in the spherical shell (with the standard uniform grid) is thus given by (nr + 1)(10mt 2 + 2).TERRA utilizes the Galerkin finite element formulation.Excluding pressure, which is piecewise constant and discontinuous, all dynamic variables use local linear basis functions.The discretised form of Eqs. ( 1) and ( 2) are solved with a Uzawa type pressure correction approach, coupled to a conjugate gradient algorithm.The basis of this approach is that the velocity and pressure, determined by solving Eq. ( 2) alone, should be corrected until Eq. ( 1) is satisfied (Yang and Baumgardner, 2000).The algorithm was originally proposed by Verfuerth (1984) and is outlined in detail by both Atanga and Silvester (1992) and Ramage and Wathen (1994).The algorithm exploits a multigrid inner-solver and, hence, TERRA is ideal for investigating and validating the geometric multigrid refinement methodology.The discretised form of Eq. ( 3) is solved by means of a flux-form finite difference method (see Baumgardner, 1985, for further details), while time-stepping is accomplished through a fourth-order Runge-Kutta scheme (Davies and Davies, 2009).

Reference non-uniform discretization
A reference discretization is now introduced, which will be used to illustrate (and validate) the key implementation procedures.For simplicity, fine and coarse-grid regions are selected a priori (i.e. the procedure is non-adaptive).Additionally, whilst refinement is performed radially and laterally, it is done as a function of radius only (i.e.lateral resolution is constant for each individual radial layer, but can vary between layers).It should be noted however that the strategies employed are equally valid for full lateral refinement (i.e.variations in lateral spacing, within individual radial layers).
The following discussion will focus on the discretisation displayed in Fig. 2a.The spherical shell is separated into two distinct regions (fine and coarse); the upper half (fine) is discretized by one additional refinement level (i.e. the number of nodes in each radial layer increases by a factor of 4 and there are twice as many radial layers: note that the interface between fine and coarse regions can be placed at an arbitrary, user-defined radius).Such a configuration allows the multi-level processes to be illustrated via one-dimensional diagrams.Nonetheless, in spite of its simplicity, it overcomes many disadvantages of TERRA's conventional quasiuniform structure (termed uniform for the remainder of this paper).With the original uniform scheme: 1. Grid resolution can only be increased in fixed step sizes, with successive refinements requiring an ≈ 8-fold increase in the number of nodes (the number of nodes increases by a factor of 4 and 2, laterally and radially, respectively).The solution to a given problem therefore requires ≈ 8-times more RAM and ≈ 16-times more CPU-time at the next level of refinement, the increased factor in CPU-time resulting from the need to decrease the time-step, due to the CFL constraint.A local increase in resolution is not possible.
2. Element sizes and inter-nodal distances are dependent upon radius, with the grid points denser at the inner boundary than at the surface.As a consequence, the lower boundary layer is often better resolved than its surface counterpart, whilst the dynamically controlled time-step is restricted.
The non-uniform discretization presented in Fig. 2a overcomes these shortcomings.Upper mantle resolution is enhanced when compared to the original scheme and, hence, element sizes and inter-nodal distances show greater consistency over the entire domain.Thermal boundary layers can therefore be simulated at similar resolutions and the dynamically controlled time-step becomes better-suited to the problem under examination (it is not unnecessarily restricted by smaller elements at the base of the shell).Perhaps most importantly, resolution can be increased locally, which offers greater flexibility.
There are also significant benefits to this configuration from a geophysical perspective.The viscosity of Earth's mantle increases significantly with depth (e.g.Hager et al., 1985;Mitrovica and Forte, 2004).As a consequence, finescale features will likely dominate the upper mantle convective planform, with longer wavelength features more prevalent at depth.The reference discretization presented, with higher resolution in the upper mantle, is ideally suited to such a scenario.

Numerical issues
Two key numerical issues must be addressed when implementing the multigrid refinement scheme: 1. Non-conforming grids (i.e. the presence of irregular points, or "hanging nodes", at grid interfaces).
Figure 3a illustrates the radial location of all genuine solution nodes (s nodes) at the grid interface, along line A-B of Fig. 2a.A hanging node arises at this interface, where two fine-grid elements connect with one coarse-grid element.At this node, the usual nodal solution stencil is no longer applicable; it should be modified to involve both fine-and coarsegrid components, as is done in, for example, Burstedde et al. (2013).However, within the context of pre-existing codes, such topological and mathematical complexity would be inconvenient, requiring coding of new operators and subsequently, major changes in code structure.As a result, a different route is taken here.For computational convenience,   The problem, modified to show the r nodes which are introduced for computational convenience.These nodes are not unknowns in our system of equations, but dummy nodes that are introduced to allow consistent solution derivation at all genuine nodes.With just one boundary layer (the r nodes), fine-grid solution continuity will not be satisfied during inter-grid projection: the r nodes would act as boundary values during fine-grid calculations and would not be updated.Consequently, the values projected from the fine-grid to the encircled c nodes (coarse grid nodes utilized during the multigrid process that do not explicitly contribute to the final solution), during the multigrid solution process, would be derived from nodal solutions with both fine and coarse-grid accuracy.Accordingly, a second layer of dummy nodes, the t nodes, are introduced, as illustrated in part (c).Their inclusion and the subsequent updating of r nodes during fine-grid calculations, ensures solution continuity during inter-grid projection.In summary, s nodes are the genuine nodes, intrinsic to the final solution.r and t nodes are utilized during fine-grid calculations and inter-grid projection, whilst c nodes represent coarse-grid nodes, that are integral to the overall multigrid process, but do not explicitly contribute to the final solution.
www.  Thompson et al., 1992).This band contains two layers of nodes: r and t nodes (relaxation and transfer), which are displayed in Fig. 3b and c respectively.These are initialized via interpolation from the coarse-grid solution.t nodes act as boundary values during fine-grid calculations and remain unmodified, whilst s and r nodes, now separated from the interface, are updated.t nodes thus ensure that fine-grid accuracy is transferred to the coarse-grid during inter-grid projection (see Fig. 3b).By utilizing uniform grids at each level of refinement, TERRA's standard operators can be used for both relaxation and inter-grid transfer procedures.In addition, the radial refinement structure fits in perfectly with TERRA's existing parallelization and domain decomposition configuration, retaining the equal load-balance of the original scheme (see Sect. 4 for further details).This is of utmost importance for computational efficiency.The major benefit to this technique however, is the ease at which it can be implemented, which is of great practical importance.No significant revisions were made to TERRA.Only minor modifications were necessary, which are listed below: 1.The multigrid was converted from the CS to the FAS mode of multigrid processing.For linear problems, the CS and FAS modes are directly equivalent (Brandt, 1984).Conversion to the FAS mode was simply a case of storing the full current approximation, which is the sum of the correction and its base approximation, at each grid level, as opposed to the correction alone.
2. Inter-grid transfer routines, to and from the fine-grid, were localized: projection from the fine-grid was modified to involve only fine-grid s and r nodes.Interpolation to the fine-grid was modified to initialize s, r and t nodes.Pre-existing inter-grid transfer operators were utilized.
3. Fine-grid solution routines were localized, with t nodes acting as boundary values during calculations.
The whole multigrid transfer process for the modified multilevel scheme, in the context of a four level cycle, is presented in Fig. 2c.

Methodology validation
The accuracy of the multigrid refinement algorithm, in addition to TERRA, is examined by comparing results from the modified code with analytical solutions and previously published numerical predictions.It should be emphasized that the goal of this paper is not a thorough benchmark of TERRA.While we realize that further benchmark tests/comparisons are possible, our aim here is to demonstrate that the geometric multigrid refinement technique is valid (i.e. it does not degrade results in comparison to the uniform discretization/solution algorithms).Note that for all simulations presented in this paper, the interface between fine and coarse regions of the domain is placed at ≈ 750 km depth, with refinement restricted to one level only.In addition, we focus solely on isoviscous convection.Whilst TERRA's robustness at simulating variable viscosity convection has recently been improved (see Koestler, 2011, for further details), these developments have not yet been combined with the geometric multigrid refinement technology.

Stokes flow
The first set of problems examined exclusively test the solution of Eqs. ( 1) and ( 2).Comparisons are made with quasi-analytical solutions, derived via propagator matrix methods (e.g.Hager and O'Connell, 1981;Richards and Hager, 1984).We specifically examine the response of: (i) normalized poloidal velocity coefficients at the surface and CMB; (ii) surface and CMB topography; and (iii) the predicted geoid; to a spherical harmonic temperature perturbation at a specified depth in the spherical shell.Such analytical comparisons have previously been used to validate numerous global mantle convection codes (e.g.Choblet et al., 2007;Zhong et al., 2008;Burstedde et al., 2013).
The problem is set up as follows.The inner radius is set to mimic that of Earth's core mantle boundary, r b = 3480 km, while the outer radius is set to equal that of Earth's surface, r t = 6370 km.Free-slip mechanical boundary conditions are specified at both surfaces (note that we have modified the treatment of free-slip boundary conditions from the original version of TERRA, to more accurately account for surface curvature).The driving force is a delta function temperature perturbation in radius, defined as: Here, Y lm is the spherical harmonic function of degree l and order m.Cases are investigated at a range of spherical harmonic degrees (2, 4, 8, 16) and grid resolutions, for both uniform and non-uniform grids.
Results are presented in Fig. 4.These demonstrate that, in general, both uniform and non-uniform configurations agree well with analytical solutions.Furthermore, although there are exceptions, non-uniform configurations generally yield a better accuracy for a given number of nodal points (i.e. they generally plot on, or below, the uniform cases).For all diagnostics, results are convergent, with approximately second order convergence observed in the errors for poloidal velocity coefficients, as would be expected.The agreement between model predictions and analytical solutions does diminish as one goes to higher and higher harmonic degrees.However, this is to be expected and is consistent with the predictions of previous studies (e.g.Choblet et al., 2007;Zhong et al., 2008;Burstedde et al., 2013).

Low Rayleigh number convection
We next examine three cases of low Rayleigh number, symmetric, 3-D flows.These cases, which have also been examined by a range of other codes (e.g.Bercovici et al., 1989;Ratcliff et al., 1996;Yoshida and Kageyama, 2004;Stemmer et al., 2006;Choblet et al., 2007;Tackley, 2008;Zhong et al., 2008) test our solution strategy for all three governing equations.The first class of cases are for tetrahedral symmetry at Ra = 7 × 10 3 , while the second and third set of cases are for cubic symmetry at Ra = 3.5 × 10 3 and Ra = 1 × 10 5 respectively.Results are compared with those of previous investigations.We specifically examine Nusselt numbers at the surface (Nu t ) and base (Nu b ) of the shell, and the mean global non-dimensionalized RMS velocity ( V RMS ): In the above equations, r t , r b , Q t and Q b are the upper and lower radii and non-dimensional heat fluxes, respectively.Q is non-dimensionalized by k T /d, where k and T are the thermal conductivity and temperature contrast across the mantle depth, d = r t − r b .u is the non-dimensionalized velocity, non-dimensionalized via u = ud/κ, with κ denoting the thermal diffusivity.Nusselt numbers are determined by solving the time-dependent energy equation until the relative variation in the Nusselt number between five consecutive time-steps is < 10 −5 .RMS velocities are calculated once Nusselt numbers have converged.Results are presented in Table 1, with representative plots of the thermal fields in Fig. 5.  1.
When examining the results of fully uniform cases, we observe an excellent agreement with a range of other studies, demonstrating that TERRA is robust and accurate for this particular class of problem.Results for non-uniform cases are also consistent with previous studies, indicating that the revised methodology is valid.A comforting observation is the small difference between upper and lower boundary Nusselt numbers, indicating that the modified scheme is globally conservative.These results, along with those presented in Sect.3.1, demonstrate that the code and, hence, the new techniques, are valid for this class of problem.

Parallel efficiency and computational cost
The parallel performance and strong scaling of TERRA and the non-uniform extension is next examined.TERRA's parallel implementation is enabled by MPI (see Bunge and Baumgardner, 1996 for a full description).In brief, the spherical shell is decomposed into smaller subdomains and spread across a number of processes.The first step is to divide each icosahedral diamond into a series of subdiamonds/subdomains.As noted in Sect.(Bercovici et al., 1989), R96 (Ratcliff et al., 1996), Y04 (Yoshida and Kageyama, 2004), S06 (Stemmer et al., 2006), C07 (Choblet et al., 2007), T08 (Tackley, 2008), Z08 (Zhong et al., 2008).D13 is the present study.U/NU represents the grid configuration, with U being uniform and NU non-uniform.# Nodes denotes the total number of nodes, with a resolution of r × (θ × φ) in radial (and lateral) direction.For non-uniform cases, fine and [coarse] nodal resolutions are separated, using square brackets.grid intervals along an icosahedral diamond edge is referred to as mt.A second parameter is used to define the size of the subdomains: nt -the number of grid intervals along the edge of a local subdomain.The values of mt and nt must be such that mt is a power of 2 and nt is also a power of two less than or equal to mt.The next step in the decomposition is to select the number of subdomains to distribute to each process.This is defined by the parameter nd -the number of diamonds from which subdomains will be mapped onto the processes.nd can have a value of 5 or 10; if nd = 5 only Northern Hemisphere di-amonds are mapped onto the first half of the processes and Southern Hemisphere diamonds to the second half.If nd = 10 each process owns one subdomain from each of the ten diamonds (see Fig. 6).To finalize the domain decomposition, subdomains are extended throughout the radial dimension, from Earth's surface to the Core-Mantle-Boundary (CMB).This procedure is identical for all grid configurations (i.e.uniform and non-uniform).The number of processes is thus given by (mt/nt) 2 ×(10/nd).For parallel efficiency calculations we consider a symmetric, cubic flow, at Ra = 1 × 10 5 , identical to the final case examined in Section 3.2.We compute the problem for 100 time steps.Calculations are carried out at a variety of problem sizes (i.e.resolutions), using between 4 and 4096 cores (i.e.processes/CPU's) on HECToR, the UK national supercomputing service.Figure 7a illustrates the results, showing, as expected, faster execution with a larger number of CPUs.A reduction in elapsed time and, thus, a good improvement in speed ("speedup") is observed for all configurations.In addition, all configurations follow a similar pattern, thus demonstrating that the non-uniform configuration integrates well with TERRA's domain decomposition strategy, maintaining the equal load balance properties of the original scheme.A selection of results summarizing speedup and efficiency are displayed in Table 2.Note that results are generally consistent between different problem sizes (i.e.mt -see Fig. 7a) and, hence, only one set of results is presented fully.
The efficiency of the original, uniformly discretized configuration is first examined.If TERRA scaled perfectly, each case would show 100 % parallel efficiency (i.e. for a given problem size, increasing the number of processes by a factor of n would speed up the calculation by a factor of n).However, as expected, that is not the case.Assuming an efficiency of 100 % on 8 CPU's (this problem is too large to run on a single CPU), efficiency decreases to 58.76 % on 512 CPU's (Table 2).Such an observation is easily understood: for a given problem size, as the number of cores increases, there is a tendency for the number of pressure solve iterations to increase, leading to a reduction in computational efficiency.In addition, individual process subdomains extend throughout the radial dimension -they are long and thin, with a large surface area.As the number of CPU's increases, the ratio of surface area to subdomain size increases, leading to greater message passing, which, ultimately, restricts the performance and speedup of the code (Bunge and Baumgardner, 1996).We consistently observe that cases at nd = 5 are more efficient than those at nd = 10, since less inter-process communication is required (see Fig. 6).A reassuring point to note is that as the problem size increases, the amount of work per node at each time step remains reasonably consistent (see Fig. 7a for example, moving from a uniform mesh simulation on 4 CPU's at mt = 32, to a simulation on 32 CPU's at mt = 64, to a simulation on 256 CPU's at mt = 128).This demonstrates that the multigrid achieves its goal of attaining a convergence rate that is independent of the number of grid points.
Focussing now on the modified, non-uniform, discretizations, we see that the expended CPU-time decreases in comparison to uniform cases.This is despite the fact that the number of pressure solve iterations increases by, on average, 25 %.The observed speedup is therefore largely due to a reduction in the number of nodes (or degrees of freedom) and, hence, the number of calculations.In addition, there is an increase in parallel efficiency to 67.56 % on 512 CPU's with non-uniform cases, implying a better balance between communication and processing, when compared to uniform grids (see Table 2).If one neglects the aforementioned differences in parallel efficiency, it is possible to estimate a maximum theoretical speedup for non-uniform configurations, based on the ratio between the number of nodes involved in non-uniform and uniform calculations.The performance of the non-uniform grid configuration is displayed in Fig. 7b.Although results fall short of the maximum theoretical speedup, performance improves as the problem size increases.Such behaviour is to be expected: the implementation involves interpolation of values to and from ghost nodes and calculations across these ghost nodes.Consequently, computational overheads arise.However, as grid resolution increases, the boundary band of ghost layers makes up a smaller percentage of the computational domain (the number of radial layers in the calculation increases, but the number of ghost layers remains fixed) and, hence, the computational overhead decreases (see Fig. 7c).

Memory
The total memory requirements for uniform and non-uniform grid configurations are presented in Table 3.For uniform cases, as noted previously, the memory addressed should theoretically increase by a factor of ≈ 8 with successive refinements.However, the practical memory requirements vary from this idealized value.In moving from mt = 32 to mt = 64, the amount of memory addressed increases by a factor of ≈ 6. Moving from mt = 64 to mt = 128, from mt = 128 to mt = 256 and from mt = 256 to mt = 512 requires ≈ 7.2, ≈ 7.6 and ≈ 7.9 times more RAM, respectively.These variations are caused by fixed static memory allocation in a number of TERRA's arrays, which leads to larger overhead at coarser resolutions.
Table 2. CPU-time for different grid configurations across a range of cores, with the domain decomposed according to nt -the number of grid intervals along the edge of a local subdomain, and nd -the number of diamonds mapped to a local process.The speedup factor is calculated relative to the 8-core simulation (the problem was too large to run on a single core), whilst the efficiency is calculated from the following formula: speedup factor/(# cores/8).Note that whilst we only present the results for cases at mt= 128, the observed trends are consistent across different problem sizes.For nonuniform cases, mt denotes the lateral resolution in fine regions of the domain.We observe faster execution using a larger number of CPUs, as expected.Interestingly, variations are observed between different domain decompositions, with nd = 5 cases (circles) being more efficient than nd = 10 (stars).Due to a reduction in the number of nodes, the expended CPU-time decreases in moving from uniform to non-uniform configurations.Results for all configurations follow a similar pattern, illustrating that the non-uniform configurations integrate well with TERRA's parallel domain decomposition strategy; (b) the speedup attained when utilizing a non-uniform grid configuration, compared to an estimate of the maximum theoretical speedup, based purely upon a ratio between the number of nodes in each case.As grid resolution increases, a greater speedup is observed, converging towards the theoretical maximum; (c) the computational overhead of using a non-uniform grid configuration -the overhead decreases as the problem size increases.
To test the numerical implementation of the non-uniform cases, we compare the actual memory requirements with those predicted by simple scaling relationships.One can estimate that a non-uniform, mt = 512/256 case, incorporating lateral and radial refinement in the upper 25 % of the shell, should theoretically require ≈ 181.63 Gb of RAM (i.e. a fac-tor of ≈ 3 greater than a uniform mt = 256 case).The practical memory requirement of ≈ 197.45 Gb is therefore exceptional, demonstrating that the scheme has been implemented efficiently.The minor overheads are caused by ghost nodes at the fine/coarse interface.As discussed in the previous section, these overheads decrease with increasing problem size.

Conclusions
This paper has explored the potential of hierarchical geometric multigrid refinement techniques for 3-D spherical mantle convection codes.The methodology, based around the application of a multigrid solver on non-uniform, structured grids, yields highly efficient solutions to multi-resolution problems, providing significant benefits for global 3-D spherical mantle convection simulations: localized variations in resolution are possible, negating the need for complete global refinement.Consequently, computational resources are exploited more efficiently.The technique is conceptually simple and, perhaps most importantly, straightforward to implement within preexisting mantle convection codes.The proposed methodology has been validated and an excellent agreement is observed with analytical results and those from a wide-range of other studies.Results also demonstrate that TERRA, the code utilized in examining the multigrid refinement procedures, is robust and accurate for the class of problems examined herein.
It is important to emphasize that the refinement strategies presented allow simulations of global 3-D spherical mantle convection, with a lateral resolution of ≈ 14 km at both boundaries, on a system with ≈ 200 Gb of RAM.When compared to standard, uniform configurations, the memory footprint is therefore reduced by a factor of ≈ 3, whilst typically, simulations require a factor of ≈ 2.5 less CPU-time.Consequently, although the scheme may be less beneficial than the fully adaptive techniques currently under development (e.g.Davies et al., 2011;Kronbichler et al., 2012;Burstedde et al., 2013), it will allow pre-existing codes to examine more challenging problems than have previously been possible (indeed, it has already done so, as demonstrated by Davies and Davies, 2009).Given the amount of investment that has gone into these codes, a method such as that presented, which will extend their lifetime and applicability, is a worthwhile and significant development.Indeed, with such capabilities, global 3-D spherical mantle convection simulations, at Earthlike convective vigour, will no longer be restricted to individuals / institutions with the largest and most advanced computational facilities, as has previously been the case.
Fig. 1. (a)An example of the hierarchy of uniform grids used in regular geometric multigrid cycles.A standard bisection refinement rule is employed; each quadrilateral element is split into four elements at the next grid level; (b) a non-uniform grid and the uniform levels it is made of.In essence, a non-uniform grid is a union of uniform sub-grids.However, unlike the traditional grids utilized in a multigrid, the sub-grids do not necessarily extended over the same domain.

Fig. 2 .
Fig. 2. (a)The reference grid configuration implemented in this study.The final solution is derived from distinctive local grids, with high resolution in the upper half of the spherical shell and coarser resolution in the lower half.(b) A radial section, drawn along A-B, illustrating the final non-uniform solution grid.(c) The multigrid solution process, illustrated for a four level cycle (grids h − 8h).Grid level h is a local fine-grid that spans the upper half of the mantle.Nodes shaded in red are those intrinsic to the final solution (s nodes).r and t nodes are utilized during fine-grid calculations.They ensure solution accuracy on the fine-grid and solution continuity during inter-grid projection.n nodes would occur in the regular multigrid formulation, but do not exist in the modified formulation.Grid level 2h is a global grid, covering the whole mantle.As with grid level h, nodes shaded in red are part of the final solution (s nodes).Conversely, nodes shaded in grey (c nodes) are only utilized during the multigrid process; they do not explicitly contribute to the final solution.Grids 4h and 8h are further global grids, which are involved in the multigrid solver but do not explicitly contribute to the final solution.Black arrows denote inter-grid projection.These are reversed for coarse-to-fine grid interpolation, whilst dashed orange arrows are also applicable.Note, grid resolution is decimated for illustrative purposes.

Fig. 3 .
Fig. 3. (a)The problem: one coarse-grid element interfacing with two fine-grid elements.The location of genuine unknowns (i.e.unknowns that are associated with an approximation to the governing differential equation), s nodes, is shown for both fine and coarse grids.(b) The problem, modified to show the r nodes which are introduced for computational convenience.These nodes are not unknowns in our system of equations, but dummy nodes that are introduced to allow consistent solution derivation at all genuine nodes.With just one boundary layer (the r nodes), fine-grid solution continuity will not be satisfied during inter-grid projection: the r nodes would act as boundary values during fine-grid calculations and would not be updated.Consequently, the values projected from the fine-grid to the encircled c nodes (coarse grid nodes utilized during the multigrid process that do not explicitly contribute to the final solution), during the multigrid solution process, would be derived from nodal solutions with both fine and coarse-grid accuracy.Accordingly, a second layer of dummy nodes, the t nodes, are introduced, as illustrated in part (c).Their inclusion and the subsequent updating of r nodes during fine-grid calculations, ensures solution continuity during inter-grid projection.In summary, s nodes are the genuine nodes, intrinsic to the final solution.r and t nodes are utilized during fine-grid calculations and inter-grid projection, whilst c nodes represent coarse-grid nodes, that are integral to the overall multigrid process, but do not explicitly contribute to the final solution.

Fig. 4 .
Fig. 4.Relative errors in numerical predictions with respect to semi-analytical solutions (e.g.Hager and O'Connell, 1981;Richards and Hager, 1984) for: (a) normalized harmonic coefficients for poloidal velocity at the surface and CMB (combined); (b) surface and CMB topography (combined); and (c) total geoid; for a range of uniform (circles) and non-uniform (stars) grids, at spherical harmonic degrees 2, 4, 8, and 16.Note that continuous lines connect the results from uniform cases.

Fig. 6 .
Fig.6.Subdomain process mapping in TERRA for a case where mt/nt = 2.In (a) nd = 10, while in (b) nd = 5.The diamonds have been projected on to a flat surface and solid black lines define their boundaries, while dashed lines represent subdomain boundaries.The number within each subdomain denotes the MPI rank of the process to which the subdomain is mapped.
Fig. 7. (a)Parallel performance of TERRA; the elapsed computational time as a function of the number of CPU's, at a range of different scales, utilizing different grid configurations -red and blue lines denote uniform and non-uniform configurations respectively.For nonuniform cases, mt denotes the lateral resolution in fine regions of the domain.We observe faster execution using a larger number of CPUs, as expected.Interestingly, variations are observed between different domain decompositions, with nd = 5 cases (circles) being more efficient than nd = 10 (stars).Due to a reduction in the number of nodes, the expended CPU-time decreases in moving from uniform to non-uniform configurations.Results for all configurations follow a similar pattern, illustrating that the non-uniform configurations integrate well with TERRA's parallel domain decomposition strategy; (b) the speedup attained when utilizing a non-uniform grid configuration, compared to an estimate of the maximum theoretical speedup, based purely upon a ratio between the number of nodes in each case.As grid resolution increases, a greater speedup is observed, converging towards the theoretical maximum; (c) the computational overhead of using a non-uniform grid configuration -the overhead decreases as the problem size increases.

2013 1102 D. R. Davies et al.: Geometric multigrid refinement techniques for mantle convectionTable 1 .
Thermal amplitude benchmark comparisons for an isoviscous fluid.Abbreviations in the first column refer to the studies used for comparison: B89 2.2.1, the number of www.geosci-model-dev.net/6/1095/2013/Geosci.Model Dev., 6, 1095-1107, The respective discretization method (DM) is listed, where SP indicates spectral, FE finite element, FD finite differences and FV finite volume.V RMS denotes the mean non-dimensionalized RMS velocity, whilst Nu t and Nu b represent the upper and lower boundary Nusselt numbers, respectively.Note that, theoretically, Nu t and Nu b should be equal.

Table 3 .
Memory requirements for different grid configurations.Non-uniform cases incorporate refinement in the upper 750 km (or ≈ 25 %) of the spherical shell.Note that the theoretical RAM is calculated via the following formula, using the NU 64/32 case as an example: (nodes mt64/32/nodes mt32)×RAM mt32.