The VOLNA-OP2 Tsunami Code (Version 1.0) Response to referees

. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a ﬁnite volume non-linear shallow water equations (NSWE) solver built on the OP2 domain speciﬁc language ::::: (DSL) : for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high performance computing platforms: CPUs, the Intel Xeon Phi, and GPUs. This is achieved in a way that the scientiﬁc code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years, here we discuss how it can be 5 integrated into various workﬂows, such as a statistical emulator. The scalability of the code is demonstrated on three super-computers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity to its users, as well as performance and portability on ::::: across : a


Main points:
Point 1 -This is presumably the first model implementation of its kind, and I find that the implementation should be of interest for the tsunami community, as well as other scientific communities with interest of solving shallow water wave equations or related problems without in depth knowledge of different types of hardware architecture.This part is highly regarded.Reply -We give a better overview of the model in the paper, however, the numerical implementation has been described in detail in previous work (which we now highlight better).Changes in manuscript -Section 4.2, Page 5 line 28 -Page 6 line 26.
Point 2 -The study of the speedup on different types of hardware are also new, and the findings are interesting in their own right.However, if possible, I encourage the authors to see if it is possible to compare the model speedup also with other models (such as HySEA) for inter model comparison.
Reply -Direct comparison is very difficult, because the open source models and implementation available all differ from ours in various, but significant ways -e.g.Tsunami-HySEA only supports structured meshes (vs.our unstructured meshes) and a single GPU -support for nested meshes and multi-GPU are not available in the public version.Instead, we pull results from their published papers to discuss the performance and the speedups they achieve.Changes in manuscript -Section 2, Page 3 line 28 -33.
Point 3 -The validation of the model is entirely missing.I know that the previous VOLNA codes have been benchmarked towards NTHMP tests previously, but this is a new implementation.While one may expect a similar accuracy for this code as well, validation needs to be demonstrated.To emphasise this, the novelty of this paper actually hinges on some kind of proof; i.e. that the model can produce results consistent with previous versions.Moreover, no explicit tsunami results are shown, only results showing the speedup.As a minimum, some results showing that the tsunami code gives a reasonable output needs to be included.I would propose that the authors include one or two of the standard tsunami inundation benchmark tests.I'm sure the authors have some such tests available.Reply We are adding results of two NTHMP benchmarks to demonstrate the numerical accuracy of the code.Changes in manuscript -Section 4.3, Page 6 line 28 -Page 9 line 9.
Point 4 -The text and reference list is a bit imbalanced with respect to the authors own work.It would be beneficial if some more external references are added, reference to external work is Comment 8 -I think comparing the VOLNA-OP2 performance in different architectures is a bit unfair, at least for two reasons: 1-The number of "degrees-of-freedom" (or simply triangles) per core is generally different, and so is the proportion of time spent is computation and communication.Even data locality may have been impacted.2-These architectures are profoundly different among them -in theoretical attainable peak performance, price, release date, etc.I don't like sentences such as "The GPU system with X1 nodes was Y times faster than the CPU system with X2 nodes"; I don't think they add much to the paper, and in fact they may be misleading.Perhaps you can normalise over a common metric, but again that's quite tough.Also, the Phi suffers a lot from the lack of the vectorisation in 'applyFluxes', and it's unclear whether this issue can be worked around or not.Reply -While for computer science benchmarking purposes we entirely agree that such a comparison is not fair, this paper is aimed at people who use and run tsunami modeling software (most of the authors are not computer scientists).We believe that for this audience such speedup numbers are relevant, as they will have to choose a platform to run on, and their basis for decision is only the cost of access and the relative performance between them.We dedicate a section to this (Running Costs and Power Consumption), and now point out that this kind of comparison is indeed unfair.Changes in manuscript -Section 5.6, Page 17, line 2 -line 13 Comment 9 -In the conclusions, I disagree with the sentence: "Through performance scaling and analysis of the code on (...), we have demonstrated that VOLNA-OP2 indeed delivers nearoptimal performance on a variety of platforms (...)" .My point is that I don't think that you have demonstrated that you're achieving near-optimal performance -it is true that some loops are relatively close to the architecture memory bandwidth limit, but others are not.We don't know the reason -in fact, we don't even know whether these loops are compute-or memory-bound (see point above).So I suggest to either drop that sentence, or rephrase it, or add a roofline plot.Reply -We have dropped this statement, especially because for unstructured mesh computations this is difficult to quantify.Also because fair comparison to other codes was not possible.Changes in manuscript -Section 6, Page 18, line 1 -line 2 Comment 10 -Of course, I assume that comparing the performance of VOLNA-OP2 to that of other codes is way too difficult, if not impossible Reply -Unfortunately we found no open-source codes that would allow direct comparison.We have extracted some performance figures from related papers Changes in manuscript -Section 2, Page 3 line 28 -33.
Comment 11 -I see that version 1.0 has been "Zenodoized" -don't forget to add the DOI to the paper.Also, I suggest to move the code to its own organisation on GitHub so that permissions can be more easily set and contributing to VOLNA-OP2 gets easier.I also suggest to link VOLNA-OP2 from the 'apps' section of OP2.Reply -We have added DOIs to the paper Changes in manuscript -Section 6, Page 18 line 17.
Comment 12 -It might be just me, but it feels like that different sections of the paper have been written by different people.That is, the transition from some paragraphs to others is not as smooth as it might be.Also, there are some typos (e.g., outline -> outlined) and/or missing words ("the physical across") in various sections.All this should be improved prior to publication.Reply -We are making changes to the text to make the transitions smoother, and clearing up typos.Changes in manuscript -Throughout Sections 1-6 Comment 13 -Section 3.1 might benefit from some figures.I know this is not the main point of the article, but I'm not sure that readers who are unfamiliar with OP2 will be able to understand how, for example, GPU parallelisation works.Maybe just cite some prior paper?Reply -In the interest of brevity, we added citations to previous work detailing the parallelisation approaches.Changes in manuscript -Section 3.1, Page 5 line 13 Comment 14 -In the performance section, Table 1 gives a name to three meshes: M1, M2, M3.
These could be used systematically throughout the whole section, and in the plots as well, instead of referring to "the 1.4M mesh".Reply -We have made the suggested changes -with the new non-uniform meshes, there are labelled as NU_3...NU_0 Changes in manuscript -Table 1, Page 13, Figures 9-10, Page 15 and 17, Sections 5.2-5.5.
For widespread use therefore three key ingredients are needed; first, the stability and robustness of the numerical approach, that gives a confidence in the results produced, second, the computational performance of the code, which allows for getting the right results quickly, efficiently utilising the available computational resources, and third, the ability to integrate into a workflow, allowing for simple pre-and post-processing, efficiently supporting the kinds of use cases that come up -for example large numbers of different initial conditions.
In the Related Work section we discuss a number of codes currently being used in production, and as such are trusted and reliable codes, already being used as part of a workflow.Yet, the computational performance of most of these codes is "good 5 enough" ::::: "good ::::::: enough"; they were written by domain scientists, and may have been tuned to one architecture or an other, but for example, GPU support is almost non-existent.In today' : 's and tomorrow' : 's quickly changing hardware landscape however, " : "future-proofing" : " : numerical codes is of exceptional importance for continued scientific delivery.it provides an abstraction for expressing unstructured mesh computations at a high-level, and then provides automated tools to translate scientific code written once, into a range of high-performance implementations targeting multi-core CPUs, GPUs, and 15 large heterogeneous supercomputers.The original VOLNA model (Dutykh et al. (2011)) was already discussed and validated in detail -: it : was used in production for small-scale experiments and modelling, but was inadequate for targeting large-scale scenarios and statistical analysis, therefore it was re-implemented on top of OP2; this paper describes the process, challenges and results from that work.
These applications present a number of challenges in integration into the workflow, as well as scalable performance: the need for extracting snapshots of state variables on the full mesh, or at a number of specified locations, capturing the maximum wave elevation or inundation -all in the context of distributed memory execution.
As the above references indicate, VOLNA-OP2 has already been key in delivering scientific results in a range of scenarios, and through the collaboration of the authors, it is now capable of efficiently supporting a number of use cases, making it a versatile tool to the community, therefore we have now publicly released it: it is freely available at github.com/reguly/volna. github.com/reguly/volna.: The rest of the paper is organised as follows: Section 2 : 2 : discusses related work, Section 3 : 3 presents the OP2 library, upon which VOLNA-OP2 is built, Section 4 : 4 : discusses the VOLNA simulator itself, its structure and features, Section 5 : 5 discusses performance and scalability results on CPUs and GPUs, and finally Section 6 : 6 : draws conclusions.

Related Work
Tsunamis have long been a key target for scientific simulations.Behrens and Dias (2015) give a detailed look at various mathematical, numerical, and implementational approaches to past and current tsunami simulations.The most common set of equations solved are the shallow water equations, and most codes use structured and nested meshes.A popular discretisation is finite differences, such codes include: NOAA's MOST (Titov and Gonzalez (1997)), COMCOT (Liu et al. (1998)), CENALT (Gailler et al. (2013)).On more flexible meshes many use the finite element discretisation, such as SELFE (Zhang and Baptista (2008)) and TsunAWI (Harig et al. (2008)), ASCETE (Vater and Behrens (2014)), Firedrake-Fluids (Jacobs and Piggott (2015)) or the finite volume discretisation, such as the VOLNA code (Dutykh et al. (2011)), GeoClaw (George and LeVeque (2006)) or HySEA (Macías et al. (2017)).Another model is described by the Boussinesq equations -these equations and the solver are more complex than shallow-water solvers.Since there is no consensus as to their advantage over it, they are :::: they ::: are :::::::: primarily As far as we are aware, only Tsunami-HySEA (Macías et al. (2017)), also using finite volumes, is using GPU clusters in production -that code however only supports GPUs, and is hand-written in CUDA.The OP2 library (Mudalige et al. (2012)) is a domain specific language embedded in C and Fortran that allows unstructured 20 mesh algorithms to be expressed at a high level, and provides automatic parallelisation and a number of other features.It provides an abstraction that lets the domain scientist describe a mesh using a number of sets (such as quadrilaterals or vertices), connections between these sets (such as edges-to-nodes), and data defined on sets (such as x, y ::: x, y : coordinates on vertices).
Once the mesh is defined, an algorithm can be implemented as a sequence of parallel loops, each over all elements of a given set applying different "kernel functions" :::::: "kernel ::::::::: functions", accessing data either directly on the iteration set, or indirectly through at most one level 25 of indirection.This abstraction enables the implementation of a wide range of algorithms, such as the finite volume algorithms that VOLNA uses, but it does require that for any given parallel loop, the order of execution must not affect the end result (within machine precision) -this precludes the implementation of e.g.Gauss-Seidel iterations.
OP2 enables its users to write an application only once using its API, which is then automatically parallelised to utilise multi-core CPUs, GPUs, and large supercomputers through the use of MPI, OpenMP and CUDA.This is done in part through 30 a code generator that parses the parallel loop expressions and generates boilerplate code around the computational kernel to facilitate parallelism and data movement, and in part through different back-end libraries that manage data, including MPI halo exchanges, or GPU memory management, as shown in Figure 1.

Parallelisation Approaches in OP2
OP2 takes full responsibility for orchestrating parallelism and data movement -from the user perspective, the code written looks and feels like sequential C code ::: that :::::: makes :::: calls ::: to :: an ::::::: external :::::: library.To utilise clusters and supercomputers, OP2 uses the Message Passing Interface (MPI) to parallelise in a distributed memory environment; once the mesh is defined by the user, OP2 automatically partitions and distributes it among the available resources.It uses the standard owner-compute approach with halo exchanges, and overlaps computations with communications.In conjunction with MPI, OP2 uses a number of shared-memory parallelisation approaches, such as CUDA and OpenMP.
A key challenge in the fine-grained parallelisation of unstructured mesh algorithms is the avoidance of race conditions when data is indirectly modified.For example, in a parallel loop over edges, when indirectly incrementing data on vertices, 10 multiple edges may try to increment the same vertex, leading to race conditions.OP2 uses a colouring approach to resolve this; elements of the iteration set are grouped into mini-partitions, and each element within these mini-partitions is coloured, so no two elements of the same colour access the same value indirectly.Subsequently mini-partitions are coloured as well.For CUDA, we assign mini-partitions of the same colour to different CUDA thread blocks, and for OpenMP to different threads.

Input and Output
OP2 supports parallel file I/O through the HDF5 library (The HDF Group (2000-2010)), which is critically important to its integration into VOLNA' : 's workflow: reading in the input problem and writing out data required for analysis simultaneously on multiple processes.
4 The VOLNA simulator 4.1 Model, numerics, previous validation The finite volume (FV) framework is the most natural numerical method to solve the non-linear shallow water equations (NSWE), ::: in ::: part ::::::: because ::: of :::: their :::::: ability :: to :::: treat :::::: shocks :::: and :::::::: breaking ::::: waves.It belongs to a class of discretisation schemes that are highly efficient in the numerical solution of systems of conservation laws, which are very common in compressible and incompressible fluid dynamics.Finite Volume methods are preferred over finite differences and often over finite elements because they intrinsically address conservation issues, improving their robustness: total energy, momentum and mass quantities are conserved exactly, assuming no source terms, and appropriate boundary conditions.The code was validated against the classical benchmarks in the tsunami community :: as :::::::: described ::::: below.

Numerical model
In a typical megatsunami (say, with characteristic wavelength ~100km, average ocean depth ~4km and tsunami wave amplitude of ~0.5 m), the non-linearity and dispersion are found to be weak (Dutykh et al. (2011)).Although frequency dispersion as the tsunami approaches the shore (or shallow depths) may precipitate local effects, they are overwhelmed by the inundation/run-up caused by the longer waves.Accordingly, the :::::::: Following :::: the ::::: needs :: of :::: the ::::: target ::::::::::: applications, ::: the : following non-dispersive NSWEs (in Cartesian coordinates) form the physical model of VOLNA: Here, b (x, t) :::::: d (x, t) is the time-dependent bathymetry, v (x, t) is the horizontal component of the depth-averaged velocity, g is the acceleration due to gravity and H (x, t) is the total water depth.Further, I 2 is the identity matrix of order 2. The tsunami wave height or elevation of free surface ⌘ (x, t), is computed as, where the sum of static bathymetry b s (x) ::::: d s (x) and the dynamic seabed uplift u z (x, t) constitute the dynamic bathymetry, b s :: d s is usually sourced from bathymetry datasets pertaining to the region of interest (say, global datasets like ETOPO1/GEBCO or regional bathymetries).The vertical component u z (x, t) of the seabed deformation is calculated depending on the physics of tsunami generation, e.g.via co-seismic displacement for finite fault segmentations by Gopinathan et al. (2017) ::::::::::::::::::: , submarine sliding by Salmanidou et al. (2017Salmanidou et al. ( , 2018) ) etc.The time-dependency in uz (x,t) enables the tsunami to be actively generated (Dutykh and Dias (2009)).This is a step-forward from the common passive mode of tsunamigenesis that utilises an instanta-neous rupture.The active mode is particularly important for tsunamigenic earthquakes with long and slow ruptures, In addition to the capabilities of employing active generation and consequent tsunami propagation, VOLNA also models the run-up/run-down (i.e.:: i.e. the final inundation stage of the tsunami).These three functionalities qualify VOLNA to simulate the entire tsunami life-cycle.The ability of the NSWEs (1-2 :: 1-2) to model both propagation, as well as run-up and run-down processes was validated in Kervella et al. (2007) and Dutykh et al. (2011), respectively.Thus, the use of uniform model for the entire life-cycle obviates many technical issues such as the coupling between the sea bed deformation and the sea surface deformation and the use of nested grids.

Code structure
The structure of the code is outline in Algorithm 1 ::::::: outlined :: in ::::::::: Algorithm :: 1; the user inputs a configuration file (.vln), which specifies the mesh to be read in from gmsh files, as well as initial/boundary conditions of state variables, such as the bathymetry deformation starting the tsunami, which can be defined in various ways (mathematical expressions or files, or a mix of both).
We use a variable timestep second-order ::::::::: third-order ::::: (four ::::: stage) Runge-Kutta method for evolving the solution in time.In each iteration, events may be triggered; e.g.further bathymetry deformations, displaying the current simulation time, or outputting simulation data to VTK files for visualisation.The original VOLNA source code was implemented in C++, utilising libraries such as Boost (Schling (2011)).This gives a very clear structure, abstracting data management, event handling and low level array operations for the higher level algorithm -an example is shown in Figure 2. :: 5.While this coding style was good for readability, it had its limitations in terms of performance: there was excessive amounts of data movement and certain operations could not be parallelised -indirect increments with 30 potential race conditions in particular.Some features -such as describing the bathymetry lift with a mathematical formula -were implemented with functionality and simplicity, not performance, in mind.op_arg_gbl(&dT,1,"float", OP_READ), op_arg_dat(outConservative,-1,OP_ID,4,"float",OP_RW), op_arg_dat(inConservative,-1,OP_ID,4,"float",OP_READ), op_arg_dat(midPointConservative,-1, OP_ID,4,"float",OP_READ), op_arg_dat(values_new,-1,OP_ID,4,"float", OP_WRITE)); using an arbitrary code that is a function of spatial coordinates and time also reduces the computational and memory overheads when running a set of simulations.
Through performance scaling and analysis of the code on traditional CPU clusters, as well as GPUs and Intel's Xeon Phi, we have demonstrated that VOLNA-OP2 indeed delivers near-optimal performance on a variety of platforms and, depending on problem size, scales well to multiple nodes.
There is still a need for even more streamlined and efficient workflows.For instance, we could integrate within VOLNA, the finite fault source model for the slip with some assumptions on the rupture dynamics, we could also integrate the bathymetrybased ::::::::::::::: bathymetry-based meshing (the mesh needs to be tailored to the depth and gradients of the bathymetry to optimally reduce computational time).Indeed, there would be even less exchanges of files and more efficient computations, especially in the context of uncertainty quantification tasks such as emulation or inversion.
In the end, the gain in computational efficiency will allow higher resolution modelling, such as using 2 m ::: 2 m : topography and bathymetry collected from LIDAR, i.e. a greater capability.It will allow greater capacity by enabling more simulations to be performed.Both of these enhancements will subsequently lead to better warnings more tailored to the actual impact on the coast as well as better urban planning since hazard maps will gain in precision geographically and probabilistically, due to the possibility of exploring a larger number of more realistic scenarios.