CUDA-C implementation of the ADER-DG method for linear hyperbolic PDEs

We implement the ADER-DG numerical method using the CUDA-C language to run the code in a Graphic Processing Unit (GPU). We focus on solving linear hyperbolic partial di ﬀ erential equations where the method can be expressed as a combination of precomputed matrix multiplications becoming a good candidate to be used on the 5 GPU hardware. Moreover, the method is arbitrarily high-order involving intensive work on local data, a property that is also beneﬁcial for the target hardware. We compare our GPU implementation against CPU versions of the same method observing similar convergence properties up to a threshold where the error remains ﬁxed. This behaviour is in agreement with the CPU version but the threshold is larger that in the CPU case. 10 We also observe a big di ﬀ erence when considering single and double precision where in the ﬁrst case the threshold error is signiﬁcantly larger. Finally, we did observe a speed up factor in computational time but this is relative to the speciﬁc test or benchmark problem.


Introduction
Many scientific research directions rely heavily on simulation-based knowledge gain, because experiments are either too costly or not possible.In those areas, where mechanical or fluid-dynamical processes play a role, these simulations often consist of numerical solutions of partial differential equations (PDEs).In particular, the time required to obtain the solution is of large importance in the quest for ever more reliable and more accurate research results.While fast and accurate computations can be achieved by advancement of algorithmic approaches, new hardware has also helped to solve ever larger and more complex problems.Here we concentrate on the second approach and employ specialised hardware for reducing computational time.In particular a General Purpose Graphics Processing Unit (GPGPU) (Owens et al., 2007) demands for carefully optimised algorithms but offer large gains in computational performance.Introduction

Conclusions References
Tables Figures

Back Close
Full The NVIDIA Tesla C2070 card for example has a peak double precision floating point performance of 515 Gflops.Graphics Processing Units (GPUs) were originally developed to assist the Central Processing Unit (CPU) of workstation type computers in processing all the data related to graphics.GPUs are based on the Single Instruction, Multiple Data (SIMD) programming paradigm where one instruction is used to compute a large set of similarly structured but distinct data, creating an important source of parallelism and therefore reducing computational time.In recent years, the developers of GPUs realised that scientific computation could benefit from this hardware architecture to obtain faster numerical solutions at low cost and started producing the GPGPUs together with a more user-friendly software interface.
The availability of standardised software and hardware triggered a number of new GPU implementations of numerical methods solving Navier-Stokes (Asouti et al., 2011), linear elasticity (Komatitsch et al., 2010;Rietmann et al., 2012;Mu et al., 2013), or shallow water (Brodtkorb et al., 2012;de la Asunci ón et al., 2012) equations among others.The leading facilities in the current TOP500 list of supercomputers comprise GPGPU architectures (Meuer et al., 2012), and it is foreseeable that these types of machines will prevail in the coming years.
In this manuscript we present a numerical implementation of the ADER-DG (K äser and Dumbser, 2006;Castro et al., 2010)  with f (x, t) = a(x)u(x, t) and g(x, t) = b(x)u(x, t) the flux functions in x and y directions, respectively.In vectorial notation ∂U(x, t) ∂t + ∂F(x, t) ∂x + ∂G(x, t) ∂y = 0 , (2) with F(x, t) = A(x)U(x, t) and G(x, t) = B(x)U(x, t) the flux vectors in x and y direction, respectively.We use an arbitrarily high-order in space and time Discontinuous Galerkin numerical method.We said arbitrarily in the sense that the order of the method is an input parameter in the implementation and that there is no theoretical limit for it.In practice, the maximum achievable order depends on machine accuracy and is strongly limited by memory and CPU time resources.PDE systems of the form (2) represent (between others) the advection of a vector U(x, t) as a consequence of a velocity field given by the entries of matrices A(x) and B(x).This velocity field is space dependent and has not a divergence free constraint.Equation ( 2) is used for tracer advection or linear elasticity in its velocity-stress formulation.
The mentioned ADER-DG numerical method has several desired properties: it is high-order accurate in space and time and therefore numerical errors are minimized; it can be implemented on unstructured meshes which allows us to consider complex geometries; it makes use of a reference spatial coordinate system, where basis functions are defined, and therefore many integrals can be pre-computed; and it uses a small stencil to communicate to neighbour elements requiring only direct neighbours for the flux computation independent of the applied order.In summary, the discrete version of the numerical method can be formulated as a set of matrix-matrix multiplication steps and is therefore a good candidate to be implemented on the GPU hardware.
It should be noted that the method presented is capable of utilizing non-uniform and unstructured meshes.While in Sect. 4 we present results for triangular meshes, the method works for triangular and quadrilateral elements.This is in contrast to many CUDA implementations, that make use of the simplicity of structured rectangular and uniform meshes and corresponding computational stencils.Introduction

Conclusions References
Tables Figures
The manuscript is organized as follows: in Sect. 2 we describe the numerical method, in Sect. 3 we briefly describe the hardware and explain the data structure to best fit in the GPU architecture, in Sect. 4 we show numerical results where we compare and assess the new CUDA implementation and finally in Sect. 5 we discuss the results and show the conclusions.

Numerical method
In this section we describe the derivation of the numerical method which follows on previous developments presented in the literature by Dumbser (2005); K äser and Dumbser ( 2006 where we use super-index i to identify element (triangle or quadrangle) E i ∈ Ω.The p-th component of the unknown vector U(x, t) is approximated using a linear combination of time-dependent degrees of freedom (dof) ûpl (t) and space-dependent basis functions φ l (ξ, η) which are defined in a reference coordinate system (ξ, η).See Appendix A for details on the basis functions and mapping from physical to reference space coordinate systems.For x ∈ E i we approximate the p-th component of U(x, t)

Conclusions References
Tables Figures

Back Close
Full with N l the number of basis functions used to approximate the continuous function u p (x, t) and defined by where O is the approximation order of the numerical method.In what follows we will drop the space and time dependences to simplify notation.
Taking the p-th component of Eq. ( 2), multiplying by the base function φ k which is in the same space as φ l and integrating in space over the i -th element we have or using the divergence operator Here and using the divergence theorem we obtain Using Einstein notation we write h p = f p , g p = A pq , B pq u q .Expanding the derivatives on the second term in Eq. ( 7) and ordering them we have (9) Here we used the chain rule to expand the derivatives and define that we use the * upper index to identify the matrices written in the reference space coordinates.From Eq. ( 2) we see that A(x) and B(x) are spatially dependent matrices therefore we can approximate them in the same manner as Eq.(3) writing with where O m ≤ O is the approximation order for the matrices.This means that we can change the approximation order used to represent the wind field for the advection equation or the material properties in the elastic wave equation.
See Castro et al. (2010) for sub-cell resolution approximation.
Assuming that the Jacobian of the mapping J = ∂(x, y)/∂(ξ, η) is constant inside element E i (see Appendix A for details on the mapping) we can easily compute h * p considering that are obtained from Eqs. ( 9) and ( 10).Now we substitute Eqs. in the integration domain, with pre-computed matrices The boundary integral in Eq. ( 12) is the flux contribution from neighbour elements 5 and it is computed using a numerical flux function.

Numerical flux
The numerical flux for a linear PDE can be expressed as a linear function of the unknown vectors from the two adjacent elements U i and U i j , with j counting the direct Introduction

Conclusions References
Tables Figures

Back Close
Full neighbour elements as indicated in Fig. 1, where  (Rusanov, 1970).The same procedure is used for the matrix B.
Using the Rusanov flux the numerical flux function is expressed as with S + = I |n x λ + A + n y λ + B | and I the identity matrix.Writing Eq. ( 15) in tensor notation we have (16) Before substituting the Jacobian approximations (10) into Eq.( 16) we choose to evaluate λ in the numerical flux function.For implementation purposes we still use a polynomial expansion of it but knowing that all but the first degree of freedom Ŝ+ pqm are zeros.Now the space dependent flux function is written as follows Finally the boundary integral in Eq. ( 12) responsible for the flux computation is dis- where |S j | is the length of the j -th edge and N s is the number of edges depending on triangular or quadrilateral elements.The boundary integrals F the j -th edge, Note that F j ,0 kl m is responsible for the flux contribution from the element E i while kl m considers the contribution of the neighbour element as the neighbour degrees of freedom are considered in the second integral by φ i j l (1 − χ j ).Now the semi-discrete 5 scheme is written emphasizing the remaining time dependence (20) Introduction

Conclusions References
Tables Figures

Back Close
Full Integrating Eq. ( 20) in time ûi,n+1 we obtain the discrete scheme to update the numerical solution from time level t = t n to time level t = t n+1 = t n + ∆t.The time step ∆t is defined from the stability condition ∆t = CFL ∆x/S max . 5 The time integral ûi pl (τ) dτ is obtained from the Cauchy-Kowalewski procedure and explained in the following section.

Time integration
This time integration scheme is based on a local space-time expansion that is valid inside the element E i for one time step ∆t and is constructed as follows.Introduction

Conclusions References
Tables Figures

Back Close
Full Writing Eq. ( 2) in the reference space coordinates, multiplying by the basis function φ k and integrating over the reference element yields Introducing the polynomial approximation defined in Eqs. ( 3) and ( 10) we find an expression for the first time derivative of the degrees of freedom Applying recursively this algorithm we can estimate any order time-derivative using Now we can approximate the time evolution of the numerical solution for τ ∈ [0, ∆t] using a Taylor expansion where O is the order of the numerical method.The time integral of the degrees of freedom is easily computed from Eq. ( 25) to obtain Using Eq. ( 26) we can update the numerical solution ( 21) in one single step from t = t n to t = t n+1 .

Sub model
In the previous sections we developed the numerical method to solve Eq. ( 2) that represents, for example, the advection of a tracer driven by a velocity field which can accelerate and therefore is spatially dependent and non divergence-free.In geophysical science, we found another very interesting partial differential equation that represents the propagation of seismic waves, written in the form of the linear elastic wave equation.In its velocity-stress formulation (Virieux, 1984(Virieux, , 1986) is a linear hyperbolic problem which can not be written in conservative form, instead is formulated in the quasi-conservative form as presented in K äser and Dumbser (2006); Castro et al. (2010), Here, matrices A(x) and B(x) appear outside the respective spatial derivatives.In the limit case where A(x) ≡ A and B(x) ≡ B are constant, Eqs. ( 2) and ( 27 which is equivalent to Eq. ( 27) reads with F(x, t) = A(x)U(x, t) and G(x, t) = B(x)U(x, t).
The introduction of the right hand side source term implies two modifications to our numerical method.First, the time integration scheme from Eq. ( 24) is replaced by Second, the update Eq. ( 21) needs to be modified incorporating the source contribution as follows ûi,n+1

GMDD Introduction Conclusions References
Tables Figures

Back Close
Full In a more detailed view the programming primitives are defined by grids, blocks and threads as shown in Fig. 3 where finally the CUDA-C functions, called kernels, run.
Beside this we need to consider resources like global memory, shared memory and registers among others which are limited in size and bandwidth/latency, see NVidia (Consulted Nov 2012b) for more details.
Once an algorithm is written in a kernel, multiple versions of the same kernel are executed acting on different data (in accordance with the SIMD model).On runtime, each of these kernels is called a thread and together define the thread block.A thread block is assigned to a SM to be executed and inside this block the shared memory is available to all threads.The collection of all blocks is defined by the grid as shown in Fig. 3.In the same code line where a kernel is called, the CUDA extension symbols "<<<" and ">>>" are used to define the dimensions of the grid and the block as show in Fig. 5.

Data structure
In order to use the GPU architecture we need to design a data structure such that numerical algorithms run efficiently.In this manuscript we assume that the main time loop of our computation can be entirely coded and executed in the GPU memory without copying data between CPU and GPU memory in runtime.Moreover we mainly Introduction

Conclusions References
Tables Figures

Back Close
Full consider locality of the data as our means of optimisation.Other more sophisticated approaches can deal with data alignment, cache miss minimization and many more, as explained by NVidia (Consulted Nov 2012a).In Eq. ( 3) we introduced the polynomial expansion for the unknown vector U p , where p stands for the p-th component of the PDE, approximating the solution with a linear combination of basis function φ l and degrees of freedom ûpl , with l representing the corresponding basis function.The numerical method makes intense use of the degrees of freedom in the update scheme of one element as expressed in Eq. ( 21).In order to use the dofs inside of our GPU algorithm we will store them in one single array of dimension 2. We call this array dof and the size is [nDof_l, nComp x nElem].
The first index correspond to the number of degrees of freedom used to expand the unknown vector, while the second index is obtained from the number of components of the PDE times the number of elements in the mesh.
In this form, when this array is stored inside GPU global memory, the degrees of freedom associated to element E i are stored in the position nDof_l * nComp * (i-1) ... nDof_l * nComp * (i) and therefore adjacent in the memory.We follow the same approach for the degrees of freedom of the Jacobian matrices in Eq. (11).

GPU algorithm
The CUDA algorithm used to implement the ADER-DG numerical method is organised to be contained in a shared library so we can recycle most of our existing Fortran code.The new CUDA code is structured in two parts.The first one is the driver which is called by the Fortran code after all preprocess is performed.The driver is responsible for preparing the data structure such that to optimise the GPU memory and to upload all required information into the GPU memory.Schematically this algorithm is represented in Fig. 5.We use two kernels to implement the numerical method.The first one is called TimeIntegrateDof and is responsible for the time integration step (26) and Introduction

Conclusions References
Tables Figures

Back Close
Full the volume integral.The second kernel is called FluxComputation, which depends on the previous one, and is responsible for the numerical flux computation Eq. ( 18).The kernel definition is designed such that we map each element E i to a thread block in a 2-D array.Each block then is defined to map the degrees of freedom ûi l p inside each element as graphically represented in Fig. 3.The parameters dimGrid and dimBlock (see Fig. 5) are defined during run time in order to optimize the kernel execution.Moreover we use two constants that are hard coded in a C header file such that we can define the size of the shared memory used.These two parameters are the number of components of the PDE and the number of dof given by Eq. ( 4).

#define MAX_nCOMP 5 #define MAX_nDOF 21
In this case we define them for the case of the seismic wave equation and sixth order method in two spatial dimensions.To construct a more general software one can consider to use templates but this is out of the scope of the current work.
Using this configuration we use the following C-CUDA intrinsic functions gridDim, blockIdx and threadIdx to identify the element (iElem), component (ip) and degrees of freedom (il) inside each kernel.iElem = blockIdx.y* gridDim.x+ blockIdx.x;ip = threadIdx.x;il = threadIdx.y; The reader can find the implementation of these kernels as a complementary material to this manuscript.

Memory consumption
Because our scope is to run this implementation completely inside the GPU, we need to be sure that the available memory is capable to store all the required information.
To this end we compute the theoretical memory consumption for a 5 component PDE (linear elasticity in 2-D for example) considering double precision and order of approximation from first to seventh, that is 1 to 28 degrees of freedom.In Fig. 6 we plot the Introduction

Conclusions References
Tables Figures

Back Close
Full results where horizontal axis is number of elements and vertical axis is memory.The maximum GPU memory of the Tesla C2070 cards is 6 Gb therefore we could easily fit up to 3 × 10 9 elements with fifth (p4) order of accuracy.

Numerical results
In this section we will discuss the numerical results of our GPU implementation of the ADER-DG scheme.For all results obtained from the implemented algorithm we used a fixed time step obtained from the CFL condition with S i the maximum wave speed inside element E i , ∆x i measured as the minimum distance from the barycentre to the edge of the element and CFL = 0.9.
To assess the proposed implementation we perform several test problems based on measuring the order of convergence.The empirically determined order of convergence (with respect to the mesh size) allows us to verify the expected accuracy of the numerical method and to identify code implementation errors.A convergence test is defined by running a simulation on a sequence of refined meshes k, see Fig. 7 to see three of them, and compare the numerical solution at the final time against an exact solution.Note that finer meshes are not self refinements of coarser meshes.
We start by fixing the order of the numerical method and run the simulation for each mesh.At final time we measure the error (E k ) of the complete domain Ω comparing against an exact solution using a suitable norm, for example L 1 , L 2 or L ∞ .Comparing these errors for different mesh spacing (h k ) gives us the order of convergence of the numerical scheme via the expression

Convergence test 1 for advection equation: Constant coefficients
Here we test the convergence properties of the CUDA implementation solving the conservative form Eq. ( 1) applied to the linear advection equation.We scale the periodic domain to Ω = [−0.5,0.5] × [−0.5, 0.5] and define a constant speed setting a x (x) = 1.0 and a y (x) = 1.0, see Appendix C for details of the PDE.The final time is t end = 2.0 such that the numerical solution is the same as the initial condition.The initial condition is defined using a Gaussian function with a = 0.2, (x 0 , y 0 ) = (−0.2,−0.2) and σ x = σ y = 0.05.In Fig. 8 we plot the value of u(x = 0.2, y = 0.2, t) in time for different meshes for the fourth order method.In black we have a reference solution computed with the seventh order method and mesh h = 0.025 (the right one in Fig. 7).We also show the difference between the reference solution and the numerical solution multiplied by a factor of 10, obtained with the fourth order method using the three meshes depicted in Fig. 7.We clearly see that the solution with mesh size h = 0.05 is already very good with an error of the order of 1 %.
In Fig. 9 we plot the error versus mesh spacing and computational time.We obtain the expected order of convergence up to seventh order (p6).We observe what seems to be a threshold around 10 −11 where the error does not decrease any further as we increase the order of the scheme or refine the mesh.We also observe that for any given error level, higher order schemes are more efficient.The segmented line is the result from the CPU code.The errors measured for the CPU and GPU implementation are equivalent up to machine accuracy therefore superimposed on the left side of Fig. 9. On the right side we see a big difference between these two lines with the CPU implementation being one order of magnitude more expensive.However, from this comparison we can not conclude any speed up factor, since no special care was taken to optimise the CPU code.A discussion of this fact will be postponed to the conclusions section.

Conclusions References
Tables Figures

Back Close
Full

Convergence test 2 advection equation: Variable coefficients
This test problem has spatial dependent coefficient for the velocity field and a periodic domain scaled to Ω = [−0.5,0.5] × [−0.5, 0.5].The velocity field is defined using an auxiliary coordinate system (x , y ) rotated in 45 • counter clock wise from the original Cartesian coordinates.In this auxiliary coordinate system we define the velocity field by with v 0 = 1.0, α = 0.2, λ = 2/ √ 2 and d (x ) the distance of a point to the y axis.Then, this velocity is projected back to the Cartesian coordinates defining a x (x) = v(x ) cos(π/4) and a y (x) = v(x ) sin(π/4).In Fig. 10 we show the velocity field and the auxiliary coordinate system.The final time t end is computed such that the solution coincides with the initial condition as As in test problem 1 we use a Gaussian function (33) to define the initial condition using a = 0.2, (x 0 , y 0 ) = (0, 0) and σ x = σ y = 0.05.In Fig. 11 we show the convergence results.We observe the expected order of convergence up to a threshold error level (around 10 −9 for the double precision results) where the error reach machine accuracy.
This threshold is in the order of 10 −5 for the single precision runs plotted with dashed lines (SP GPU p3m3 and SP GPU p4m4).

Convergence test 3 advection equation: A swirling deformation flow
The following test is adapted from (LeVeque, 1996) and consists of a velocity field that is space and time-dependent.The main idea is to have a swirling deformation flow that

GMDD Introduction Conclusions References
Tables Figures

Back Close
Full after half the period changes direction to recover the initial condition described by the following functions a x (x) = sin(πx) 2 sin(2πy)g(t) , b x (x) = − sin(πy) 2 sin(2πx)g(t) . (36) Function g(t) = cos(πt/T ) is responsible for the time dependency with T = 1.5 the period.The initial condition is a Gaussian function ( 33) with a = 0.5, (x 0 , y 0 ) = (0.75, 0.5) and σ x = σ y = 0.05 and the computational domain is the unit square.We use the same meshes as presented in Fig. 7.For this test we observe the expected convergence measured in the L 2 and L ∞ norm.In this case, as the wind field has space dependency we also required a high-order representation of matrix coefficients.We plot also in Fig. 7 the numerical solution obtained from a l semi-Lagrangian code (Behrens, 1996) with dashed line.

Convergence test elastic wave equation
This convergence test problem was obtained from K äser and Dumbser (2006).We use a square computational domain scaled to Ω = [−50, 50] × [−50, 50] with periodical boundaries and evolve the initial condition for one period such that we can compare with the exact solution, in this case the initial condition.We run this test fixing the order of the method, for example second order (P1), over several meshes that were constructed with increasing number of elements (decreasing mesh size).Figure 14 depicts the results using the model presented in Sect.2.3 obtained for the seismic wave equation.On the left we plotted error versus mesh size while on the right error against computing time.We compare single (SP) and double (DP) precision runs in the GPU code and the CPU code from SeisSol (2013).The solid lines are the double precision runs from second to sixth order method where we see that the expected convergence is reached (slope of the lines) and is equivalent to the error level obtained

Conclusions References
Tables Figures

Back Close
Full with SeisSol (segmented lines).Note also that at small error levels, in particular the DP P5 line reaches a limit where it seems that no further improvements are possible.The same behaviour is observed for the SeisSol code but with a smaller error.Something similar happens for the single precision runs (dotted lines) where the errors behave as expected following the double precision results up to a threshold where the error remains the same even if we increase the order of the method or further refine the mesh.In the right part we plot the error level against the computational time.We can conclude that for a specific error, say 10 −4 , a higher order method is more efficient than a lower order method as it requires less time to reach the same error level.We observe also that in the region where the single and double precision GPU code generate the same error level there is no speed up in the case of single precision.When comparing computational time between our GPU implementation and the CPU one we observe that the GPU is faster.This advantage decreases when higher order is used.
Visible is the overhead in a GPU.The time increases in particular when the work load is low (coarse meshes with low number of elements) and can lead even to higher run times than the CPU version.However, this overhead decreases relatively when more elements have to be computed so that the GPU is fully loaded and becomes more efficient than the CPU code.

Conclusions and future work
In this manuscript we presented a GPU implementation of the arbitrarily high-order ADER-DG numerical method considering unstructured meshes in two spatial dimensions.We presented a strategy to adapt memory allocation to the specific hardware.The implemented code can be easily adapted to other linear hyperbolic equations by using the CUDA-C library approach.
We found that the expected order of accuracy is reached for the presented test cases.
We also observe that there is a threshold after which the error can not be reduced.This is in particular evident when running the GPU code in single precision mode.For Introduction

Conclusions References
Tables Figures

Back Close
Full computations where very small errors are required, we suggest to use only double precision implementations.With respect to computational time, we observe that the GPU code give us a speed up factor comparing against a CPU implementation.The reduced computational time can be assigned to several components, for example the specialized hardware architecture of the GPUs, the implementation effort to optimize the code and the benchmark used to measure it.With respect to the latter one we must keep in mind that a CPU implementation is subject to optimizations as well and therefore the relative (CPU vs GPU) speed up strongly depends on how much effort was used to code it.Finally we note that during the GPU algorithm development and implementation some limited optimizations were necessary due to hardware constraints and to respect available resources of the graphic cards.
As a future work we are considering to implement a parallel GPU simulation and extend this implementation to three spatial dimensions.Introduction

Conclusions References
Tables Figures

Back Close
Full For triangular elements J and its determinant |J| are constant.On the other hand on quadrangular elements J and |J| are constant only for parallelograms.

Appendix B L 2 projection
We use the L 2 projection at time t = 0 to project he initial condition into the timedependent degrees of freedom ûi pl (t = 0) that represent the numerical solution inside element E i , where x The number of dof used depend on the order of the numerical method that we want to use with l = 1, . . ., N d .
We project the Jacobian matrices A pq and B pq in the same manner.This is done once at the beginning of the computation for each element E i and stored.The order of the approximation of these matrices is independent of the order of the numerical Introduction

Conclusions References
Tables Figures

Back Close
Full The star matrices defined in the volume integral (10) are constructed from Âi pqm and Bi pqm as follows 5 We remark that in this work we restrict to mesh topologies where the Jacobian mapping matrix J is constant inside each element.In practice this means that we can use triangular and quadrangular elements with the later at most need to be parallelograms.

C1 The advection equation
The advection equation is an hyperbolic scalar partial differential equation that can represent the propagation of a tracer driven by a velocity field, see Chapter 2.1 of LeVeque (2002).
with f (x, t) = a x (x)u(x, t) and g(x, t) = a y (x)u(x, t).The space dependent velocity field is described by its two components a x (x) and a y (x).In this case, when the velocity is space dependent we can refer to the variable coefficient linear equation, see Chapter 9 of LeVeque (2002).

C2 The elastic wave equation
The elastic wave equation is an hyperbolic partial differential equation that can represent the propagation of seismic waves in an elastic medium.In its velocity-stress formulation reads, velocities in x and y direction.The Jacobian matrices are   1.Element E i and its direct j -th neighbour E i j .The normal vector n i j is defined pointing outward from the j -th edge of element E i .

GMDD Introduction
Full   n between memory requirements versus accuracy order of the numerical method presented in this manus E and double precision representation for real numbers.Fig. 8. Time signal for test 1.The black line is the reference solution obtained with mesh 0.025 and seventh order method.The purple, blue and red lines were obtained using the fourth order method and we plot the difference against the reference solution, multiplied by a factor of 10.Linear advection eq.u(1) norm: L2 Fig. 8. Time signal for test 1.The black line is the reference solution obtained with mesh 0.025 and seventh order method.The purple, blue and red lines were obtained using the fourth order method and we plot the difference against the reference solution, multiplied by a factor of 10.Introduction

Conclusions References
Tables Figures

Back Close
Full Discussion Paper | Discussion Paper | Discussion Paper | numerical method on the CUDA (formerly Compute Unified Device Architecture) parallel programming model from NVIDIA, in particular in C with CUDA extensions.Other language options are e.g.OpenCL, CUDA-Fortran, CUDA-Python.We will focus on C-CUDA because it is open and commonly used.We will use the current generation Fermi-based Tesla card C2070, which supports double precision IEEE 754 standard floating point arithmetic, to solve a hyperbolic linear partial differential equation with variable coefficients of the form ∂u(x, t) ∂t + ∂f (x, t) ∂x + ∂g(x, t) ∂y = Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ); Castro et al. (2010); Pelties et al. (2010); Hermann et al. (2011); Pelties et al. (2012) among others.The numerical method is constructed based on a spatial discretization of the physical domain Ω ∈ R 2 considering a conforming discretization Discussion Paper | Discussion Paper | Discussion Paper | x A pq + ξ y B pq , η x A pq + η y B pq u q = A * pq , B * pq u q .
(3) and (11) into Eq.(8) and express the volume integrals in the reference element E R with |J| due to the change Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

+
A and λ + B based on the mean value of the corresponding matrices.As a consequence, S + pq is not space dependent and only varies on the edge considered Discussion Paper | Discussion Paper | Discussion Paper | precomputed in the reference space coordinates using the parameter χ j ∈ [0Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ) are equivalent.Otherwise, they represent a different problem and therefore have different solutions.One can move between both formulations making use of a source term.See LeVeque (2002) for more details on variable-coefficient linear equations.Here we briefly demonstrate the modifications required to accommodate for solving Eq. (27).It is possible to write Eq. (27) in conservative form by adding the missing part of the conservative products and therefore introducing a new source term.The final equation Introduction Discussion Paper | Discussion Paper | Discussion Paper | characteristics, grid, block and threadIn this implementation we use the NVIDIA Tesla C2070 graphic card which has 14 Streaming Multiprocessors (SM) each with 32 Streaming Processors (SP) also called CUDA cores represented in Fig.2.While this is the physical distribution inside the GPU, a developer needs to design the code considering another abstraction model.The general outline of the code to be sent to the GPU is following the SIMD (Single Instruction, Multiple Data) programming model.
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Naming the Del operator in Cartesian coordinates ∇ = [∂ x , ∂ y ] and ∇ ξ = [∂ ξ , ∂ η ] in the reference coordinates we can write Discussion Paper | Discussion Paper | Discussion Paper | method therefore we use m = 1, . . ., N d m basis functions.For x ∈ E i we use Âi

10
For completeness, here we show the two partial differential equations used to test the ADER-DG algorithm.Discussion Paper | Discussion Paper | Discussion Paper | ) Matrices A and B are the Jacobian matrices and describe the elastic medium where the waves propagate.Vector U = [σ xx , σ yy , σ xy , u x , u y ] T is the unknown vector corresponding to the normal stress in x and y direction, the shear stress and the particle Discussion Paper | Discussion Paper | Discussion Paper | ) with ρ the density and λ and µ the Lam é constants.See K äser andDumbser (2006);Castro et al. (2010) for further details.neighbour E ij .The normal vector n ij is defined pointing outward from the j t SM SP Tesla C 2070

Fig.
Fig.1.Element E i and its direct j -th neighbour E i j .The normal vector n i j is defined pointing outward from the j -th edge of element E i .

Fig. 2 .
Fig. 2. Hardware distribution of the CUDA cores in a Tesla C2070 graphic card.

Fig. 3 .f
Fig.3.Grid and block definition.We call M gx the maximum x direction block number which depends on the GPU used.Here we show the mapping between blocks and elements of our mesh E i where one block is assigned to one element.In the same manner each thread block is assigned to one of the degrees of freedom from E i .

Fig. 4 .
Fig. 4. Construction of dof to store the degrees of freedom of all elements in one single array inside GPU global memory.Here we see the example for a pde with 2 components and 3 degrees of freedom to approximate vector U.

Fig. 6 .Fig. 6 .
Fig. 6.Here we see the relation between memory requirements versus accuracy order of the numerical method presented in this manuscript considering a 5 component PDE and double precision representation for real numbers.

Fig. 7 .
Fig. 7. Meshes used in the convergence test.From left to right we see three meshes with characteristic length h = 0.1, h = 0.05 and h = 0.025.

Fig. 7 .
Fig. 7. Meshes used in the convergence test.From left to right we see three meshes with characteristic length h = 0.1, h = 0.05 and h = 0.025.
convergence test 2 including the auxiliary coordinate system (x ′ , y ′ ).

Fig. 12 .
Fig. 12. Snapshot of the numerical solution test 3.At time t = 0 the initial condition begins to deform due to the flow field.At time t = 0.75 reaches the maximum deformation and the velocity flow change direction to recover the initial condition at time t = 1.5 R the right eigenvectors and |Λ| = diag(|λ 1 |, |λ 2 |, . ..) a diagonal matrix with absolute values of the eigenvalues on the diagonal.If we evaluate |A| = 5known Rusanov flux