Devito: an embedded domain-specific language for finite differences and geophysical exploration

We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can take weeks to process a single seismic survey and create a useful subsurface image. The computational cost is dominated by the numerical solution of wave equations and their corresponding adjoints. Therefore, a great deal of effort is invested in aggressively optimizing the performance of these wave-equation propagators for different computer architectures. Additionally, the actual set of partial differential equations being solved and their numerical discretization is under constant innovation as increasingly realistic representations of the physics are developed, further ratcheting up the cost of practical solvers. By embedding a domain-specific language within Python and making heavy use of SymPy, a symbolic mathematics library, we make it possible to develop finite difference simulators quickly using a syntax that strongly resembles the mathematics. The Devito compiler reads this code and applies a wide range of analysis to generate highly optimized and parallel code. This approach can reduce the development time of a verified and optimized solver from months to days.

the featured wave equation operators in Section 3. We outline the seismic inversion theory in Section 4. Code verification and analysis of accuracy in Section 5 is followed by a discussion of the propagators computational performance in Section 6. We conclude by presenting a set of realistic seismic inversion examples and a discussion of future work.

Background
Improving the runtime performance of a critical piece of code on a particular computing platform is a non-trivial task that has received significant attention throughout the history of computing. The desire to automate the performance optimization process itself across a range of target architectures is not new either, although it is often met with skepticism. Even the very first compiler, A0 Hopper (1952), was received with resistance, as best summarized in the following quote: "Dr. Hopper believes,..., that the result of a compiling technique should be a routine just as efficient as a hand tailored routine. Some others do not completely agree with this. They feel the machine-made routine can approach hand tailored coding, but they believe there are "tricks of the trade" that apply to various special cases that a computer cannot be expected to utilize." (Jones, 1954).
Given the challenges of porting optimized codes to a wide range of rapidly evolving computer architectures, it seems natural to raise again the layer of abstraction and use compiler techniques to replace much of the manual labor.
Community acceptance of these new "automatic coding systems" began when concerns about the performance of the generated code were addressed by the first "optimizing compiler", FORTRAN, released in 1957 -which not only translated code from one language to another but also ensured that the final code performed at least as good as a hand-written low-level code (Backus, 1978). Since then, as program and hardware complexity rose, the same problem has been solved over and over again, each time by the introduction of higher levels of abstractions. The first high-level languages and compilers were targeted at solving a large variety of problems and hence were restricted in the kind of optimizations they could leverage.
As these generic languages became common-place and the need for further improvement in performance was felt, restricted languages focusing on smaller problem domains were developed that could leverage more "tricks of the trade" to optimize performance. This led to the proliferation of DSLs for broad mathematical domains or sub-domains, such as APL (Iverson, 1962), Mathematica, Matlab ® or R.
In addition to these relatively general mathematical languages, more specialized frameworks targeting the automated solution of PDEs have long been of interest (Cárdenas and Karplus, 1970;Umetani, 1985;Cook Jr, 1988;Van Engelen et al., 1996).
More recent examples not only aim to encapsulate the high-level notation of PDEs, but are often centered around a particular numerical method. Two prominent contemporary projects based on the finite element method (FEM), FEniCS (Logg et al., 2012) and Firedrake , both implement a common DSL, UFL (Alnaes et al., 2014), that allows the expression of variational problems in weak form. Multiple DSLs to express stencil-like algorithms have also emerged over time, including Henretty et al. (2013); Zhang and Mueller (2012); Christen et al. (2011); Unat et al. (2011); Köster et al. (2014); Membarth et al. (2012); Osuna et al. (2014); Tang et al. (2011); Bondhugula et al. (2008); Yount (2015). Other stencil DSLs have been developed with the objective of solving PDEs using finite differences Hawick and Playne (2013), Arbona et al. (2017) and Jacobs et al. (2016). However, in all cases their use in seismic imaging problems (or even more broadly in science and engineering) has been limited by a number of factors other than technology inertia. Firstly, they only raise the abstraction to the level of polyhedral-like (affine) loops. As they do not generally use a symbolic mathematics engine to write the mathematical expressions at a high-level, developers must still write potentially complex numerical kernels in the target low-level programming language. For complex formulations this process can be time consuming and error prone, as handtuned codes for wave propagators can reach thousands of lines of code. Secondly, most DSLs rarely offer enough flexibility for extension beyond their original scope (e.g. sparse operators for source terms and interpolation) making it difficult to work the DSL into a more complex science/engineering workflow. Finally, since finite difference wave propagators only form part of the over-arching PDE constrained (wave equation) optimization problem, composability with external packages, such as the SciPy optimization toolbox, is a key requirement that is often ignored by self-contained standalone DSLs. The use of a fully embedded Python DSL, on the other hand, allows users to leverage a variety of higher-level optimization techniques through a rich variety of software packages provided by the scientific Python ecosystem.
Moreover, several computational frameworks for seismic imaging exist, although they provide varying degrees of abstraction and are typically limited to a single representation of the wave equation. IWAVE (Symes et al., 2011;Symes, 2015b;Sun and Symes, 2010;Symes, 2015a), although not a DSL, provides a high-level of abstraction and a mathematical framework to abstract the algebra related to the wave-equation and its solution. IWAVE provides a rigorous mathematical abstraction for linear operations and vector representations including Hilbert space abstraction for norms and distances. However, its C++ implementation limits the extensibility of the framework to new wave-equations. Other software frameworks, such as Madagascar (Fomel et. al, 2013), offer a broad range of applications. Madagascar is based on a set of subroutines for each individual problem and offers modelling and imaging operators for multiple wave-equations. However, the lack of high-level abstraction restricts its flexibility and interfacing with high level external software (i.e Python , Java). The subroutines are also mostly written in C/Fortran and limit the architecture portability.

Symbolic definition of finite difference stencils with Devito
In general, the majority of the computational workload in wave-equation based seismic inversion algorithms arises from computing solutions to discrete wave equations and their adjoints. There are a wide range of mathematical models used in seismic imaging that approximate the physics to varying degrees of fidelity. To improve development and innovation time, including code verification, we describe the use of the symbolic finite difference framework Devito to create a set of explicit matrix-free operators of arbitrary spatial discretization order. These operators are derived for example from the acoustic wave equation where m(x) = 1 c(x) 2 is the squared slowness with c(x) the spatially dependent speed of sound, symbol ∆u(t, x) denotes the Laplacian of the wavefield u(t, x) and q(t, x) is a source usually located at a single location x s in space (q(t, x) = f (t)δ(x s )).
This formulation will be used as running example throughout the section.

Code generation -an overview
Devito aims to combine performance benefits of dedicated stencil frameworks (Bondhugula et al., 2008;Tang et al., 2011;Henretty et al., 2013;Yount, 2015) with the expressiveness of symbolic PDE-solving DSLs (Logg et al., 2012;Rathgeber et al., 2015) through automated code generation and optimization from high-level symbolic expressions of the mathematics.
Thus, the primary design objectives of the Devito DSL are to allow users to define explicit finite difference operators for (timedependent) PDEs in a concise symbolic manner, and provide an API that is flexible enough to fully support realistic scientific use cases. To this end, Devito offers a set of symbolic classes that are fully compatible with the general-purpose symbolic algebra package SymPy that enables users to derive discretized stencil expressions in symbolic form. As we show in Fig. 1, the primary symbols in such expressions are associated with user data that carry domain-specific meta-data information to be used by the compiler engine (e.g. dimensions, data type, grid). The discretized expressions form an abstract operator definition that Devito uses to generate low-level C code (C99) and OpenMP at runtime. The encapsulating Operator object can be used to execute the generated code from within the Python interpreter making Devito natively compatible with the wide range of tools available in the scientific Python software stack.
A Devito Operator takes as input a collection of symbolic expressions and progressively lowers the symbolic representation to semantically equivalent C code. The code generation process consists of a sequence of compiler passes during which multiple automated performance-optimization techniques are employed. These can be broadly categorised into two classes and are performed by distinct sub-packages: (CSE), factorization and loop-invariant code motion are utilized to reduce the number of floating point operations (flops) performed within the computational kernel .
-Devito Loop Engine (DLE): Well-known loop optimization techniques, such as explicit vectorization, thread-level parallelization and loop blocking with auto-tuned block sizes are employed to increase the cache utilization and thus memory bandwidth utilization of the kernels.
A complete description of the compilation pipeline is provided in Luporini et al. (2018).

Discrete function symbols
The primary user-facing API of Devito allows the definition of complex stencil operators from a concise mathematical notation.
For this purpose Devito relies strongly on SymPy . Devito provides two symbolic object types that mimic SymPy symbols, enabling the construction of stencil expressions in symbolic form: -Function: The primary class of symbols provided by Devito behaves like sympy.Function objects, allowing symbolic differentiation via finite difference discretization and general symbolic manipulation through SymPy utilities. Symbolic function objects encapsulate state variables (parameters and solution of the PDE) in the operator definition and associated user data (function value) with the represented symbol. The meta-data, such as grid information and numerical type, which provide domain-specific information to the Devito compiler are also carried by the sympy.Function object.
-Dimension: Each sympy.Function object defines an iteration space for stencil operations through a set of Dimension objects that are used to define and generate the corresponding loop structure from the symbolic expressions.
In addition to sympy.Function and Dimension symbols, Devito supplies the construct Grid, which encapsulates the definition of the computational domain and defines the discrete shape (number of grid points, grid spacing) of the function data. The number of spatial dimensions are hereby derived from the shape of the Grid object and inherited by all Function objects, allowing the same symbolic operator definitions to be used for two and three-dimensional problem definitions. As an example, a two-dimensional discrete representation of the square slowness of an acoustic wave m[x, y] inside a 5 by 6 grid points domain can be created as a symbolic function object as illustrated in Fig. 2.
It is important to note here that m[x, y] is constant in time, while the discrete wavefield u[t, x, y] is time-dependent.
Since time is often used as the stepping dimension for buffered stencil operators, Devito provides an additional function type TimeFunction, which automatically adds a special TimeDimension object to the list of dimensions. As an example, we can create an equivalent symbolic representation of the wavefield as u = TimeFunction(name='u', grid=grid), which is denoted symbolically as u(t, x, y). can be expressed in shorthand simply as u.laplace. The shorthand expression u.laplace is agnostic to the number of spatial dimensions and may be used for two or three-dimensional problems.
The discretization of the spatial derivatives can be defined for any order. In the most general case, we can write the spatial discretization in the x direction of order k (and equivalently in the y and z direction) as: where h x is the discrete grid spacing in dimension x, the constants α j are the coefficients of the finite difference scheme and the spatial discretization error is of order O(h k x ).

Temporal discretization
We consider here a second-order time discretization for the acoustic wave equation, as higher order time discretization requires us to rewrite the PDE (Seongjai Kim, 2007). The discrete second-order time derivative with this scheme can be derived from the Taylor expansion of the discrete wavefield u(t, x, y, z) as In this expression, ∆t is the size of a discrete time step. The discretization error is O(∆t 2 ) (second order in time) and will be verified in Section. 5.
Following the convention used for spatial derivatives, the above expression can be automatically generated using the shorthand expression u.dt2. Combining the temporal and spatial derivative notations, and ignoring the source term q, we can now define the wave propagation component of Equation Produces output equivalent to: . Example code defining the two-dimensional wave equation without damping using Devito symbols and symbolic processing utilities from SymPy . Assuming hx = ∆x, hy = ∆y and s = ∆t the output is equivalent to Eq. 1 without the source term qs.

0)
where Eq is the SymPy representation of an equation. In the resulting expression all spatial and temporal derivatives are expanded using the corresponding finite difference terms. To define the propagation of the wave in time, we can now rearrange the expression to derive a stencil expression for the forward stencil point in time, u(t + ∆t, x, y, z), denoted by the shorthand expression u.forward. The forward stencil corresponds to the explicit Euler time-stepping that updates the next time-step u.forward from the two previous ones u and u.backward (Equation. 4). We use the SymPy utility solve to automatically derive the explicit time-stepping scheme, as shown in Figure 3 for a second order in space discretization.
The iteration over time to obtain the full solution is then generated by the Devito compiler from the time dimension information. Solving the wave-equation with the above explicit Euler scheme is equivalent to a linear system A(m)u = q s where the vector u is the discrete wavefield solution of the discrete wave-equation, q s is the source term and A(m) is the matrix representation of the discrete wave-equation. From Eq. 4 we can see that the matrix A(m) is a lower triangular matrix that reflects the time-marching structure of the stencil. Simulation of the wavefield is equivalent to a forward substitution (solve row by row from the top) on the lower triangular matrix A(m). Since we do not consider complex valued PDEs, the adjoint of A(m) is equivalent to its transpose denoted as A (m) and is an upper triangular matrix. The solution v of the discrete adjoint wave-equation A(m) v = q a for an adjoint source q a is equivalent to a backward substitution (solve from bottom row to top row) on the upper triangular matrix A(m) and is simulated backward in time starting from the last time-step. These matrices are never explicitly formed, but are instead matrix free operators with implicit implementation of the matrix-vector product, u = A(m) −1 q s as a forward stencil. The stencil for the adjoint wave-equation in this self-adjoint case would simply be obtained with solve(eqn, u.backward) and the compiler will detect the backward-in-time update.

Boundary conditions
The field recorded data is measured on a wavefield that propagates in an infinite domain. However, solving the wave equation in a discrete infinite domain is not feasible with finite differences. In order to mimic an infinite domain, Absorbing Boundary Conditions (ABC) or Perfectly Matched Layers (PML) are necessary (Clayton and Engquist, 1977). These two methods allow the approximation of the wavefield as it is in an infinite medium by damping and absorbing the waves within an extra layer at the limit of the domain to avoid unnatural reflections from the edge of the discrete domain.
The least computationally expensive method is the Absorbing Boundary Conditions that adds a single damping mask in a finite layer around the physical domain. This absorbing condition can be included in the wave-equation as The η[x, y, z] parameter is equal to 0 inside the physical domain and increasing from inside to outside within the damping layer. The dampening parameter η can follow a linear or exponential curve depending on the frequency band and width of the dampening layer. For methods based on more accurate modelling, for example in simulation-based acquisition design (Liu and Fomel, 2011;Wason et al., 2017;Naghizadeh and Sacchi, 2009;Kumar et al., 2015), a full implementation of the PML will be necessary to avoid weak reflections from the domain limits.

Sparse point interpolation
Seismic inversion relies on data fitting algorithms, hence we need to support sparse operations such as source injection and wavefield (u[t, x, y, z]) measurement at arbitrary grid locations. Both operations occur at sparse domain points, which do not necessarily align with the logical cartesian grid used to compute the discrete solution u(t, x, y, z). Since such operations are not captured by the finite differences abstractions for implementing PDEs, Devito implements a secondary high-level representation of sparse objects  to create a set of SymPy expressions that perform polynomial interpolation within the containing grid cell from pre-defined coefficient matrices.
The necessary expressions to perform interpolation and injection are automatically generated through a dedicated symbol type, SparseFunction, which associates a set of coordinates with the symbol representing a set of non-aligned points.
For examples, the syntax p.interpolate(expr) provided by a SparseFunction p will generate a symbolic expressions that interpolates a generic expression expr onto the sparse point locations defined by p, while p.inject(field, expr) will evaluate and add expr to each enclosing point in field. The generated SymPy expressions are passed to Devito Operator objects alongside the main stencil expression to be incorporated into the generated C kernel code. A complete setup of the acoustic wave equation with absorbing boundaries, injection of a source function and measurement of wavefields via interpolation at receiver locations can be found in Section 4.2.

Seismic modeling and inversion
Seismic inversion methods aim to reconstruct physical parameters or an image of the earth's subsurface from multi-experiment field measurements. For this purpose a wave is generated at the ocean surface that propagates through to the subsurface and creates reflections at the discontinuities of the medium. The reflected and transmitted waves are then captured by a set of hydrophones that can be classified as either moving receivers (cables dragged behind a source vessel) or static receivers (ocean bottom nodes or cables). From the acquired data, physical properties of the subsurface such as wave speed or density can be reconstructed by minimizing the misfit between the recorded measurements and the numerically modelled seismic data.

Full-Waveform Inversion
Recovering the wave speed of the subsurface from surface seismic measurements is commonly cast into a non-linear optimization problem called full-waveform inversion (FWI). The method aims at recovering an accurate model of the discrete wave velocity, c, or alternatively the square slowness of the wave, m = 1 c 2 (not an overload), from a given set of measurements of the pressure wavefield u. Lions (1971); Tarantola (1984); Virieux and Operto (2009);Haber et al. (2012) shows that this can be expressed as a PDE-constrained optimization problem. After elimination of the PDE constraint, the reduced objective function is defined as: where P r is the sampling operator at the receiver locations, P T s ( T is the transpose or adjoint) is the injection operator at the source locations, A(m) is the operator representing the discretized wave equation matrix, u is the discrete synthetic pressure wavefield, q s is the corresponding pressure source and d is the measured data. While we consider the acoustic isotropic wave equation for simplicity here, in practice, multiple implementations of the wave equation operator A(m) are possible depending on the choice of physics. In the most advanced case, m would not only contain the square slowness but also anisotropic or orthorhombic parameters.
To solve this optimization problem with a gradient-based method, we use the adjoint-state method to evaluate the gradient (Plessix, 2006;Haber et al., 2012): where n t is the number of computational time steps, δd s = (P r u − d) is the data residual (difference between the measured data and the modeled data), J is the Jacobian operator and v tt is the second-order time derivative of the adjoint wavefield that solves: The discretized adjoint system in Eq. 8 represents an upper triangular matrix that is solvable by modelling wave propagation backwards in time (starting from the last time step). The adjoint state method, therefore, requires a wave equation solve for both the forward and adjoint wavefields to compute the gradient. An accurate and consistent adjoint model for the solution of the optimization problem is therefore of fundamental importance.

Acoustic forward modelling operator
We consider the acoustic isotropic wave-equation for a squared slowness m[x, y, z]. We have zero initial conditions assuming the wavefield does not have any energy before zero time. We have an additional dampening term to mimic an infinite domain (see sec 3.2.3). We have zero Dirichlet boundary conditions as the solution is considered to be fully damped at the limit of the computational domain. The PDE is defined in Eq. 5. Figure

Discrete adjoint wave-equation and FWI gradient
To create the adjoint that pairs with the above forward modeling propagator we can make use of the fact that the isotropic acoustic wave equation is self-adjoint. This entails that for the implementation of the forward wave equation eqn, shown in Fig. 5, only the sign of the damping term needs to be inverted, as the dampening time-derivative has to be defined in the    Figure 6. Example definition of a gradient operator.
direction of propagation ( ∂ ∂n(t) ). For the PDE stencil, however, we now rearrange the stencil expression to update the backward wavefield from the two next time steps as v[t − ∆t, x, y, z] = f (v[t, x, y, z], v[t + ∆t, x, y, z]). Moreover, the role of the sparse point symbols has changed (Eq. 8), so that we now inject time-dependent data at the receiver locations (adj_src), while sampling the wavefield at the original source location (adj_rec).
Based on the definition of the adjoint operator, we can now define a similar operator to update the gradient according to Eq. 7.
As shown in Fig. 6, we can replace the expression to sample the wavefield at the original source location with an accumulative update of the gradient field grad via the symbolic expression Eq(grad, grad -u * v.dt2).
To compute the gradient, the forward wavefield at each time step must be available which leads to significant memory requirements. Many methods exist to tackle this memory issue, but all come with their advantages and disadvantages. For instance, we implemented optimal checkpointing based on its library Revolve in Devito to drastically reduce the memory cost by only saving a partial time history and recomputing the forward wavefield when needed . The memory reduction comes at an extra computational cost as optimal checkpointing requires log(n t ) + 2 extra PDE solves.
Another method is boundary wavefield reconstruction (McMechan, 1983;Mittet, 1994;Raknes and Weibull, 2016) that saves the wavefield only at the boundary of the model, but still requires us to recompute the forward wavefield during the backpropagation. This boundary method has a reduced memory cost but necessitates the computation of the forward wavefield twice (one extra PDE solve), once to get the data than a second time from the boundary values to compute the gradient.

FWI using Devito operators
At this point we have a forward propagator to model synthetic data in Fig. 4, the adjoint propagator for Eq 8 and the FWI gradient of Eq. 7 in Fig. 6. With these three operators, we show the implementation of the FWI objective and gradient with Devito in Fig. 8. With the forward and adjoint/gradient operator defined for a given source, we only need to add a loop over all the source experiments and the reduction operation on the gradients (sum the gradient for each source experiment together). In practice, this loop over sources is where the main task-based or MPI based parallelization happens. The waveequation propagator does use some parallelization with multithreading or domain decomposition but that parallelism requires communication. The parallelism over source experiment is task-based and does not require any communication between the separate tasks as the gradient for each source can be computed independently and reduced to obtain the full gradient. With the complete gradient summed over the source experiments, we update the model with a simple fixed step length gradient update (Cauchy, 1847). Fig. 7 can then be included in any black-box optimization toolbox such as SciPy optimize to solve the inversion problem Eq.6. While black-box optimization methods aim to minimize the objective, there are no guarantees they find a global minimum because the objective is highly non-linear in m and other more sophisticated methods are required (Warner and Guasch, 2014;van Leeuwen and Herrmann, 2015;Peters and Herrmann, 2017;.

Verification
Given the operators defined in Section 3 we now verify the correctness of the code generated by the Devito compiler. We first verify that the discretized wave equation satisfies the convergence properties defined by the order of discretization, and secondly we verify the correctness of the discrete adjoint and computed gradient.

Numerical accuracy
The numerical accuracy of the forward modeling operator (Fig. 4) and the runtime achieved for a given spatial discretization order and grid size are compared to the analytical solution of the wave equation in a constant media. We define two measures of the accuracy that compare the numerical wavefield in a constant velocity media to the analytical solution: -Accuracy versus size, where we compare the obtained numerical accuracy as a function of the spatial sampling size (grid spacing).
-Accuracy versus time, where we compare the obtained numerical accuracy as a function of runtime for a given physical model (fixed shape in physical units, variable grid spacing).
The measure of accuracy of a numerical solution relies on a hypothesis that we satisfy for these two tests: -The domain is large enough and the propagation time small enough to ignore boundary related effects, i.e. the wavefield never reaches the limits of the domain.
-The source is located on the grid and is a discrete approximation of the Dirac to avoid spatial interpolation errors. This hypothesis guarantees the existence of the analytical and numerical solution for any spatial discretization (Igel, 2016).

Convergence in time
We analyze the numerical solution against the analytical solution and verify that the error between these two decreases at a second order rate as a function of the time step size ∆t. The velocity model is a 400m × 400m domain with a source at the center. We compare the numerical solution to the analytical solution on Fig. 9, 10.
The analytical solution is defined as (Watanabe, 2015): where H (2) 0 is the Hankel function of second kind and q(ω) is the spectrum of the source function. As we can see on Fig. 11 the error decreases near quadratically with the size of the time step with a time convergence rate of slope of 1.94 in logarithmic scale that matches the theoretical expectation from a second order temporal discretization.

Spatial discretization analysis
The spatial discretization analysis follows the same method as the temporal discretixzation analysis. We model a wavefield for a fixed temporal setup with a small enough time-step to ensure negligeable time discretization error (dt = .00625ms). We vary the grid spacing (dx) and spatial discretization order and the and compute the error between the numerical and analytical solution. The convergence rates should follow the theoretical rates defined in Eq. 2. In details, for a kth order discretization in space, the error between the numerical and analytical solution should decrease as O(dx k ). The best way to look at the 10 1 5 × 10 2 6 × 10 2 7 × 10 2 8 × 10 2 9 × 10 2 Time-step dt (ms) convergence results is yo plot the error in logarithmic scale and verify that the error decrease linearly with slope k. We show the converegence results on Fig. 12. The numerical convergence rates follow the theoretical ones for every tested order k = 2, 4, 6, 8 with the exception of the 10th order for small grid size. This is mainly due to reaching the limits of the numerical accuracy and a value of the error on par with the temporal discretization error. This behavior for high order and small grids is however in accordance with the literature as in in Wang et al. (2017).
The numerical slopes obtained and displayed on Fig. 12 demonstrate that the spatial finite difference follows the theoretical errors and converges to the analytical solution at the expected rate. These two convergence results (time and space) verify the accuracy and correctness of the symbolic discretization with Devito. With this validated simulated wavefield, we can now verify the implementation of the operators for inversion.

Propagators verification for inversion
We concentrate now on two tests, namely the adjoint test (or dot test) and the gradient test. The adjoint state gradient of the objective function defined in Eq. 7 relies on the solutions of the forward and adjoint wave equations, therefore, the first mandatory property to verify is the exact derivation of the discrete adjoint wave equation. The mathematical test we use is the standard adjoint property or dot-test: for any random The adjoint test is also individually performed on the source/receiver injection/interpolation operators in the Devito tests suite. The results, summarized in Tables 1 and 2 with F = P r A(m) −1 P −T s , verify the correct implementation of the adjoint operator for any order in both 2D and 3D. We observe that the discrete adjoint is accurate up to numerical precision for any order in 2D and 3D with an error of order 1e − 16. In combination with the previous numerical analysis of the forward modeling propagator that guarantees that we solve the wave equation, this result verifies that we the adjoint propagator is the exact numerical adjoint of the forward propagator and that it implements the adjoint wave equation.
With the forward and adjoint propagators tested, we finally verify that the Devito operator that implements the gradient of the FWI objective function (Eq. 7, Fig.6) is accurate with respect to the Taylor expansion of the FWI objective function. For a given velocity model and associated squared slowness m, the Taylor expansion of the FWI objective function from Eq. 6 for a model perturbation dm and a perturbation scale h is: These two equations constitute the gradient test where we define a small model perturbation dm and vary the value of h between 10 −6 and 10 0 and compute the error terms: We plot the evolution of the error terms as a function of the perturbation scale h knowing 0 should be first order (linear with slope 1 in a logarithmic scale) and 1 should be second order (linear with slope 2 in a logarithmic scale). We executed the gradient test defined in Eq. 12 in double precision with a 8 th order spatial discretization. The test can be run for higher orders in the same manner but since it has already been demonstrated that the adjoint is accurate for all orders, the same results would be obtained.
In Fig. 13, the matching slope of the error term with the theoretical h and h 2 slopes from the Taylor expansion verifies the accuracy of the inversion operators. With all the individual parts necessary for seismic inversion, we now validate our implementation on a simple but realistic example.

Validation: Full-Waveform Inversion
We finally show a simple example of full-waveform inversion (FWI, Eq 7) on the Marmousi-ii model (Versteeg, 1994). This result obtained with the Julia interface to Devito JULI  that provides high-level abstraction for optimization and linear algebra. The model size is 4km × 16km discretized with a 10m grid in both directions. We use a 10Hz Ricker wavelet with 4s recording. The receivers are placed at the ocean bottom (210m depth) every 10m. We invert for the velocity with all the sources, spaced by 50m at 10m depth for a total of 300 sources. The inversion algorithm used is minConf_PQN (Schmidt et al., 2009), an l-BFGS algorithm with bounds constraints (minimum and maximum velocity values constraints). The result is shown in Fig. 14 and we can see that we obtain a good reconstruction of the true model. More advanced algorithms and constraints will be necessary for more complex problem such as less accurate initial model, noisy data or field recorded data Peters and Herrmann, 2017); however the PDE solve part would not be impacted, making this example a good proof of concept for Devito.
This result highlights two main contributions of Devito. First, we provide PDE simulation tools that allow easy and efficient implementation of inversion operator for seismic problem and potentially any PDE constrained optimization problem. As described in Sec. 3 and 4, we can implement all the required propagators and the FWI gradient in a few lines in a concise and mathematical manner. Second, as we obtained this results with JUDI , a seismic inversion framework that provides a high-level linear abstraction layer on top of Devito for seismic inversion, this example illustrates that Devito is fully compatible with external languages and optimizations toolboxes and allows users to use our symbolic DSL for finite difference within their own inversion framework.

Performance
In this section we demonstrate the performance of Devito from the numerical and the inversion point of view, as well as the absolute performance from the hardware point of view. This section only provides a brief overview of Devito's performance and a more detailed description of the compiler and its performance is covered in Luporini et al. (2018).

Error-cost analysis
Devito's automatic code generation lets users define the spatial and temporal order of FD stencils symbolically and without having to reimplement long stencils by hand. This allows users to experiment with trade-offs between discretization errors and runtime, as higher order FD stencils provide more accurate solutions that come at increased runtime. For our error-cost analysis, we compare absolute error in L 2 -norm between the numerical and the reference solution to the time-to-solution (the numerical and reference solution are defined in the previous Section 5). Figure 15 shows the runtime and numerical error obtained for a fixed physical setup. We use the same parameter as in Sec. 5.1 with a domain of 400m × 400m and we simulate the wave propagation for 150ms.
The results in Fig. 15 illustrate that higher order discretizations produce a more accurate solution on a coarser grid with a smaller runtime. This result is very useful for inverse problems, as a coarser grid requires less memory and fewer time steps. A grid size two times bigger implies a reduction of memory usage by a factor of 2 4 for 3D modeling. Devito then allows users to design FD simulators for inversion in an optimal way, where the discretization order and grid size can be chosen according to the desired numerical accuracy and availability of computational resources. The order of the FD stencils also affects the best possible hardware usage that can theoretically be achieved and whether an algorithm is compute or memory bound, a trade-off that is described by the roofline model.

Roofline analysis
We present performance results of our solver using the roofline model, as previously discussed in Colella (2004) Skylake 8180 architecture (28 physical cores, 38.5 MB shared L3 cache, with cores operating at 2.5 Ghz). The observed Stream TRIAD (McCalpin, 1991(McCalpin, -2007 was 105 GB/s. The maximum single-precision FLOP performance was calculated as #cores · #avx units · #data items per vector register · 2(fused multiply-add) · core frequency = 4480 GFLOPs/s. A (more realistic) performance peak of 3285 GFLOPs/s was determined by running the LINPACK benchmark (Dongarra, 1988). These values are used to construct the roofline plots.
We show three different roofline plots, one plot for each domain size attempted, in Fig. 16, 17 Figure 16. Roofline plots for a 512 × 512 × 512 model on a Skylake 8180 architecture. The run times correspond to 1000ms of modeling for four different spatial discretization orders (4,8,12,16 Figure 17. Roofline plots for a 768 × 768 × 768 model on a Skylake 8180 architecture. The run times correspond to 1000ms of modeling for four different spatial discretization orders (4,8,12,16 Figure 18. Roofline plots for a 1024×1024×1024 model on a Skylake 8180 architecture. The run times correspond to 1000ms of modeling for four different spatial discretization orders (4,8,12,16). use of expression transformations as described in Section refsec:devito; particularly relevant for this problem is the factorization of FD weights.
We observe that the time to solution increases nearly linearly with the size of the domain. For example, for a 16th order discretization, we have a 17.1sec runtime for a 512 × 512 × 512 domain and 162.6sec runtime for a 1024 × 1024 × 1024 domain (8 times bigger domain and about 9 times slower). This is not surprising: the computation lies in the memory-bound regime and the working sets never fit in the L3 cache. We also note a drop in performance with a 16th order discretization (relative to both the other space orders and the attainable peak), especially when using larger domains ( Fig. 17 and 18). Our hypothesis, supported by profiling with Intel VTune (Intel Corporation, 2016), is that this is due to inefficient memory usage, in particular misaligned data accesses. Our plan to improve the performance in this regime consists of resorting to a specialized stencil optimizer such as YASK (see Section 7). These results show that we have a portable framework that achieves good performance on different architectures. There is small room for improvements, as the machine peak is still relatively distant, but 50-60% of the attainable peak is usually considered very good. Finally, we remark that testing on new architectures will only require extensions to the Devito compiler, if any, while the application code remains unchanged.
A key motivation for developing an embedded DSL such as Devito is to enable quicker development, simpler maintenance, and better portability and performance of solvers. The other benefit of this approach is that HPC developer effort can be focused on developing the compiler technology that is reapplied to a wide range of problems. This software reuse is fundamental to keep the pace of technological evolution. For example, one of the current projects in Devito regards the integration of YASK (Yount, 2015), a lower-level stencil optimizer conceived by Intel for Intel architectures. Adding specialized backends such as YASK -meaning that Devito can generate and compile YASK code, rather than pure C/C++ -is the key for long-term performance portability, one of the goals that we are pursuing.

Conclusions
We have introduced a DSL for time-domain simulation for inversion and its application to a seismic inverse problem based on the finite difference method. Using the Devito DSL a highly optimized and parallel finite difference solver can be implemented within just a few lines of Python code. Although the current application focuses on features required for seismic imaging applications, Devito can already be used in problems based on other equations; a series of CFD examples are included in the code repository.
The code traditionally used to solve such problems is highly complex. The primary reason for this is that the complexity introduced by the mathematics is interleaved with the complexity introduced by performance engineering of the code to make it useful for practical use. By introducing a separation of concerns, these aspects are decoupled and both simplified. Devito successfully achieves this decoupling while delivering good computational performance and maintaining generality, both of which shall continue to be improved in future versions.

Code Availability
The code source code, examples and test script are available on github at https://github.com/opesci/devito and contains a README for installation. A more detailed overview of the project, list of publication and documentation of the software generated with Sphinx is available at http://www.devitoproject.org/. To install Devito: git clone https://github.com/opesci/devito cd devito conda env create -f environment.yml source activate devito pip install -e .