Portable Multi-and Many-Core Performance for Finite Difference Codes ; Application to the Free-Surface Component of NEMO

We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently 5 of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, etc.). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good performance, both serial and parallel. We have produced versions of the PSy layer parallelised with both OpenMP and OpenACC; in both cases we were 10 able to leave the natural-science parts of the code unchanged while achieving good performance on both multi-core CPUs and GPUs. In quantifying whether or not the obtained performance is ‘good’ we also consider the limitations of the basic roofline model and improve on it by generating kernel-specific CPU ceilings.


Introduction
The challenge presented to the developers of scientific software by the drive towards Exascale computing is considerable.
With power consumption becoming the overriding design constraint, CPU clock speeds are falling and the complex, multipurpose compute core is being replaced by multiple, simpler cores.This philosophy can be seen at work in the rise of so-called accelerator-based machines in the Top 500 List of supercomputers (http://www.top500.org/):six of the top-ten machines in influenced by earlier work on OP2 (Bertolli et al., 2012;Rathgeber et al., 2012).
In common with OP2, the PSyKAl approach separates out the science code and the performance-related code into distinct layers.The calls that specify parallelism in both approaches are similar in terms of where they are placed in the code and in their semantics.However, the PSyKAl approach supports the specification of more than one kernel in a parallel region of code, compared with one for OP2, giving more scope for optimisation.In addition, the metadata describing a kernel is included with the kernel code in the PSyKAl approach whereas it is provided as a part of the kernel call in OP2.
While the PSyKAl approach is general, we are currently applying it to Atmosphere and Ocean models written in Fortran where domain decomposition is typically performed in the latitude-longitude dimension, leaving columns of elements on each domain-decomposed partition.
The top layer, in terms of calling hierarchy, is the Algorithm layer.This layer specifies the algorithm that the scientist would like to perform (in terms of calls to kernel and infrastructure routines) and logically operates on full fields.We say logically here as the fields may be domain decomposed, however the Algorithm layer is not aware of this.It is the scientist's responsibility to write this Algorithm layer.
The bottom layer, in terms of calling hierarchy, is the Kernel layer.The Kernel layer implements the science that the Algorithm layer calls, as a set of subroutines.These kernels operate on fields that are local to the process doing the computation.
(Depending on the type of kernel, these may be a set of elements, a single column of elements, or a set of columns.)Again the scientist is responsible for writing this layer and there is no parallelism specified here, but, depending on the complexity of the Kernels, there may be input from an HPC expert and/or some coding rules to help ensure that the kernels compile into efficient code.In an alternative approach (Rathgeber et al., 2015;Logg et al., 2012), kernels are generated from a high-level specification, potentially allowing them to be optimised automatically (Luporini et al., 2014).
The PSy layer sits in-between the Algorithm and Kernel layers and its functional role is to link the algorithm calls to the associated kernel subroutines.As the Algorithm layer works on logically global fields and the Kernel layer works on local fields, the PSy layer is responsible for iterating over columns.It is also responsible for including any distributed-memory operations resulting from the decomposition of the simulation domain, such as halo swaps and reductions.
As the PSy layer iterates over columns, the single-core performance can be optimised by applying transformations such as manipulation of loop bounds (e.g.padding for SIMD) and kernel in-lining.Additionally, the potential parallelism within this iteration space can also be exploited and optimised.The PSy layer can therefore be tailored for a particular hardware (such as multi-core, many-core, GPUs, or some combination thereof) and software (such as compiler, operating system, MPI library, etc.) configuration with no change to the Algorithm or Kernel layer code.This approach therefore offers the potential for portable performance.In this work we apply optimisations to the PSy layer manually.The development of a tool to automate this process will be the subject of a future paper.
Clearly the separation of code into distinct layers may have an effect on performance.This overhead, how to get back to the performance of a parallel, hand-optimised code, and potentially improve on it, will be discussed in the remainder of this paper.

The 'NEMOLite2D' Program
For this work we have used a program, 'NEMOLite2D,' developed by ourselves (https://puma.nerc.ac.uk/trac/GOcean).NEMO-Lite2D is a vertically-averaged version of NEMO (Nucleus for European Modelling of the Ocean (Madec, 2014)), retaining only its dynamical part.The whole model system is represented through one continuity equation (1) (for the update of the sea-surface height) and two vertically-integrated momentum equations (2) (for the two velocity components, respectively).
where ζ and U represent the sea-surface height and horizontal velocity vector, respectively.h is the total water depth, Ω is the Earth rotation velocity vector.g is the acceleration due to gravity and ν is the kinematic viscosity coefficient.
The external forcing includes surface wind stress, bottom friction, and open-boundary barotropic forcing.A lateral-slip boundary condition is applied along the coast lines.The open boundary condition can be set as a clipped or Flather's radiation condition (Flather, 1976).The bottom friction takes a semi-implicit form for the sake of model stability.As done in the original version of NEMO, a constant or Smagorinsky horizontal viscosity coefficient is used for the horizontal viscosity term.
The traditional Arakawa C structured grid is employed here for the discretisation of the computational domain.A twodimensional integer array is used to identify the different parts of the computational domain; it has the value of one for ocean, zero for land and minus one for ocean cells outside of the computational domain.This array enables the identification of ocean cells, land cells, solid boundaries and open boundaries.
For the sake of simplicity, the explicit Eulerian forward time stepping method is implemented here, except that the bottom friction takes a semi-implicit form.The Coriolis force can be set in explicit or implicit form.The advection term is computed with a first-order upwind scheme.
The sequence of the model computation is as follows: 1. Set the initial conditions (water depth, sea surface height, velocity) Since any real oceanographic computational model must output results, we ensure that any PSyKAl version of NEMOLite2D retains the Input/Output capability of the original.This aids in limiting the optimisations that can be performed on the PSyKAl version to those that should also be applicable to full oceanographic models.Note that although we retain the I/O functionality, all of the results presented in this work carefully exclude the effects of I/O since it is compute performance that interests us here.
In the Algorithm layer, fields (and grids) are treated as logically global objects.Therefore, as part of creating the PSyKAl version of NEMOLite2D, we represent fields with derived types instead of arrays in this layer.These types then hold information about the associated mesh and the extents of 'internal' and 'halo' regions as well as the data arrays themselves.This frees the natural scientist from having to consider these issues and allows for a certain degree of flexibility in the actual implementation (e.g.padding for alignment or increasing array extent to allow for other optimisations).The support for this is implemented as a library (which we term the GOcean Infrastructure) and is common to the PSyKAl versions of both NEMOLite2D and Shallow.
In re-structuring NEMOLite2D to conform to the PSyKAl separation of concerns we must break up the computation into multiple kernels.The more of these there are, the greater the potential for optimisation of the PSy layer.This re-structuring gave eight distinct kernels, each of which updates a single field at a single point (since we have chosen to use point-wise kernels).
With a little bit of tidying/re-structuring, we found it was possible to express the contents of the main time-stepping loop as a single invoke (a call to the PSy layer) and a call to the I/O system (Figure 1).The single invoke gives us a single PSy-layer routine which consists of applying each of the kernels to all of the points requiring an update on the model mesh.In its basic, unoptimised ('vanilla') form, this PSy-layer routine then contains a doubly-nested loop (over the two dimensions of the model grid) around each kernel call.
As with any full oceanographic model, boundary conditions must be applied at the edges of the model domain.Since NEMOLite2D applies external boundary conditions (e.g.barotropic forcing), this is done via user-supplied kernels.

Methodology
Our aim in this work is to achieve portable performance, especially between multi-core CPU and many-core GPU systems.
Consequently, we have performed tests on both an Intel Ivy Bridge CPU (E5-2697 at 2.7 GHz) and on an NVIDIA Tesla K40 GPU.On the Intel-based system we have used the Gnu, Intel and Cray Fortran compilers (versions 4.9.1, 15.0.0.090 and 8.3.3, respectively).The code that made use of the GPU was compiled using version 15.10 of the PGI compiler.
We first describe the code transformations performed for the serial version of NEMOLite2D.We then move on to the construction of parallel versions of the code using OpenMP and OpenACC.Again, we describe the key steps we have taken in this process in order to maximise the performance of the code.In both cases our aim is to identify those transformations which must be supported by a tool which seeks to auto-generate a performant PSy layer.

Transformations of Serial NEMOLite2D
In Table 1 we give the optimisation flags used with each compiler.For the Gnu and Intel compilers, we include flags to encourage in-lining of kernel bodies.The Intel flag "-xHost" enables the highest-level of SIMD vectorisation supported by the host CPU (AVX in this case).The Intel flag "-fast" and Cray flags "-O ipa5" and "-h wp" enable inter-procedural optimisation.
Before applying any code transformations, we first benchmark the original, serial version of the code.We also benchmark the vanilla, unoptimised version after it has been re-structured following the PSyKAl approach.In addition to this benchmarking, we profile these versions of the code at the algorithm level (using a high-level timing API).The resulting profiles are given  Beginning with the vanilla PSyKAl version, we then apply a series of code transformations while obeying the PSyKAl separation of concerns, i.e. optimisation is restricted to the middle, PSy layer and leaves the kernel and algorithm layers unchanged.The aim of these optimisations is to recover, as much as is possible, the performance of the original version of the code.The transformations we have performed and the reasons for them are described in the following sections.

Constant loop bounds
In the vanilla PSy layer, the lower and upper bounds for each loop over grid points are obtained from the relevant components of the derived type representing the field being updated by the kernel being called from within the loop.In our previous work (Porter et al., 2016) we found that the Cray compiler in particular produced more performant code if we changed the PSy layer such that the array extents are looked up once at the beginning of the PSy routine and then used to specify the loop bounds.We have therefore applied that transformation to the PSy layer of NEMOLite2D.

Addition of safe_address directives
Much of the optimisations we have performed have been informed by the diagnostic output produced by either the Cray or Intel compilers.Many of the NEMOLite2D kernels contain conditional statements.These statements are there to check whether e.g.
the current grid point is wet or neighbours a boundary point.A compiler is better able to optimise such a loop if it can be sure that all array accesses within the body of the loop are safe for every trip, irrespective of the conditional statements.In its diagnostic output the Cray compiler notes this with messages of the form: A loop starting at line 448 would benefit from "!dir$ safe_address".
In order that we could safely add such directives we altered the GOcean infrastructure to allocate all field-data arrays with extents greater than strictly required.We were then able to safely add the safe_address before all of the loops where the Cray compiler indicated it might be useful (the Momentum loops and some of the BC loops).

In-line Momentum kernel bodies into middle-layer
The profiling data in Table 2 shows that it is the Momentum section that accounts for the bulk of the model run-time.We therefore chose to attempt to optimise this section first.In-keeping with the PSyKAl approach, we are only permitted to optimise the middle (PSy) layer which, for this section comprises calls to two kernels, one for each of the x and y components of momentum.These kernels are relatively large; each comprises roughly 85 lines of Fortran executable statements.
From our previous work (Porter et al., 2016) on a similar code we know that kernel in-lining is critical to obtaining performance with both the Gnu and Intel compilers.For the Gnu compiler, this is because it cannot perform in-lining when routines are in separate source files.In our previous work we obtained an order-of-magnitude speed-up simply by moving subroutines into the module containing the middle layer (from which the kernels are called).A further performance improvement of roughly 30% was obtained when the kernel code was manually inserted at the site of the subroutine call.
Although the Intel compiler does do in-lining when routines are in separate source files, we have found (both here and in our previous work (Porter et al., 2016)) that the extent of the optimisations it performs is reduced if it first has to in-line a routine.
For the Intel-compiled Shallow code, manually inserting kernel code at the site of the subroutine call increased performance by about 25%.
In fact, in-lining can have a significant effect on the Intel compiler's ability to vectorise a loop.Taking the loop that calls the kernel for the u-component of momentum as an example, before in-lining the compiler reports: ---end vector loop cost summary ---

LOOP END
Looking at the 'estimated potential speedup' in the compiler output above, it is clear that the way in which the compiler vectorises the two versions must be very different.This conclusion is borne out by the fact that if one persuades the compiler to vectorise the first version (through the use of a directive) then the performance of the resulting binary is worse than that where the loop is left un-vectorised.In principle this could be investigated further by looking at the assembler that the Intel compiler generates but that is outside the scope of this work.
For the best possible performance, we have therefore chosen to do full, manual inlining for the two kernels making up the Momentum section.

Force SIMD vectorisation of the Momentum kernels using directives
It turns out that the Cray-compiled binaries of both the original and PSyKAl versions of NEMOLite2D perform considerably less well than their Intel-compiled counterparts.Comparison of the diagnostic output from each of the compilers revealed that while the Intel compiler was happy to vectorise the Momentum loops, the Cray compiler was choosing not to: A loop starting at line 99 was blocked with block size 256.
8 A loop starting at line 99 was not vectorized because it contains conditional code which is more efficient if executed in scalar mode.
Inserting the Cray vector always directive persuaded the compiler to vectorise the loop: 99. !dir$ vector always 100. do ji = 2, M-1, 1 A loop starting at line 100 was blocked with block size 256.
A loop starting at line 100 requires an estimated 17 vector registers at line 151; 1 of these have been preemptively forced to memory.
A loop starting at line 100 was vectorized.
which gave a significant performance improvement.This behaviour is in contrast to that obtained with the Intel compiler: its predictions about whether vectorising a loop would be beneficial were generally found to be reliable.

Work around limitations related to derived types
Having optimised the Momentum section as much as permitted by the PSyKAl approach, we turn our attention to the three remaining sections of the code.The profile data in Table 2 shows that these regions are all comparable in terms of cost.What is striking however, is that the cost of the Continuity section increases by more than a factor of three in moving to the PSyKAl version of the code.
Comparison of the diagnostic output from the Cray and Intel compilers revealed that the Cray compiler was vectorising the Continuity section while the Intel compiler reported that it was unable to do so due to dependencies.After some experimentation we found that this was due to limitations in the compiler's analysis of the way components of Fortran derived types were being used.Each GOcean field object, in addition to the array holding the local section of the field, contains a pointer to a GOcean grid object.If a kernel requires grid-related quantities (e.g. the grid spacing) then these are obtained by passing it a reference to the appropriate array within the grid object.Although these grid-related quantities are read-only within a compute kernel, if they were referenced from the same field object as that containing an array to which the kernel writes then the Intel compiler identified a dependency preventing vectorisation.This limitation was simply removed by ensuring that all read-only where ssha is the only field that is written to by the kernel.We remove any potential confusion by instead obtaining the grid-related (read-only) quantities from a field (sshn_t in this case) that is only read by the kernel:

In-line the Continuity kernel
As with the Momentum kernel, we know that obtaining optimal performance from both the Gnu and Intel compilers requires that a kernel be manually in-lined at its call site.We do this for the Continuity kernel in this optimisation step.

In-line remaining kernels (BCs and time-update)
Having optimised the Continuity section we finally turn our attention to the Boundary Condition and Time-update sections.
The kernels in these sections are small and dominated by conditional statements.We therefore limited our optimisation of them to manually in-lining each of the kernels into the PSy layer.

In-line field-copy operations
The Time-update section includes several array copies where fields for the current time-step become the fields at the previous time-step.Initially we implemented these copies as 'built-in' kernels (in the GOcean infrastructure) as they are specified in the algorithm layer.However, we obtained better performance by simply in-lining these array copies into the PSy layer.
We shall see that the transformations we have just described do not always result in improved performance.Whether or not they do so depends both on the compiler used and the problem size.We also emphasise that the aim of these optimisations is to make the PSy layer as compiler-friendly as possible, following the lessons learned from our previous work with the Shallow code (Porter et al., 2016).It may well be that transforming the code into some other structure would result in better performance on a particular architecture.However, exploring this optimisation space is beyond the scope of the present work.
10 is exhausted as well as giving us some insight into the decisions that different compilers make when optimising the code.

Construction of OpenMP-Parallel NEMOLite2D
For this part of the work we began with the optimised PSyKAl version of the code, as obtained after applying the various transformations described in the previous section.As with the transformations of the serial code, our purpose here is to determine the functionality required of a tool that seeks to generate the PSy layer.

Separate PARALLEL DOs
The simplest possible OpenMP-parallel implementation consists of parallelising each loop nest in the PSy layer.This was done by inserting an OpenMP PARALLEL DO directive before each loop nest so that the iterations of the outermost or j loop (over the latitude dimension of the model domain) are shared out amongst the OpenMP threads.This leaves the innermost (i) loop available for SIMD vectorisation by the compiler.
The loop nest dealing with the application of the Flather boundary condition to the y-component of velocity (v) has a loop-carried dependency in j which appears to prevent its being executed in parallel. 1This was therefore left unchanged and executed on thread 0 only.

Single PARALLEL region
Although very simple to implement, the use of separate PARALLEL DO directives results in a lot of thread synchronisation and can also cause the team of OpenMP threads to be repeatedly created and destroyed.This may be avoided by keeping the thread team in existence for as long as possible using an OpenMP PARALLEL region.We therefore enclosed the whole of the PSy layer (in this code, a single subroutine) within a single PARALLEL region.The directive preceeding each loop nest to be parallelised was then changed to an OpenMP DO.We ensured that the v-Flather loop nest was executed in serial (by the first thread to encounter it) by enclosing it within an OpenMP SINGLE section.

First-touch policy
When executing an OpenMP-parallel program on a Non-Uniform Memory Access (NUMA) compute node it becomes important to ensure that the memory locations accessed by each thread are local to the hardware core upon which it is executing.
One way of doing this is to implement a so-called 'first-touch policy' whereby memory addresses that will generally be accessed by a given thread during program execution are first initialised by that thread.This is simply achieved by using an OpenMP-parallel loop to initialise newly-allocated arrays to some value, e.g.zero.
1 Only once this work was complete did we establish that boundary conditions are enforced such that it can safely be executed in parallel.
11 Since data arrays are managed within the GOcean infrastructure this optimisation can again be implemented without changing the natural-science code (i.e. the Application and Kernel layers).

Minimise thread synchronisation
By default, the OpenMP END DO directive includes an implicit barrier, thus causing all threads to wait until the slowest has completed the preceeding loop.Such synchronisation limits performance at larger thread counts and, for the NEMOLite2D code, is frequently unecessary.E.g. if a kernel does not make use of the results of a preceeding kernel call then there is clearly no need for threads to wait between the two kernels.
We analysed the inter-dependencies of each of the code sections within the PSy layer and removed all unecessary barriers by adding the NOWAIT qualifier to the relevant OpenMP END DO or END SINGLE directives.This reduced the number of barriers from eleven down to four.

Amortize serial region
As previously mentioned, the v-Flather section was executed in serial becuase of a loop-carried dependence in j.(In principle we could choose to parallelise the inner, i loop but that would inhibit its SIMD vectorisation.)This introduces a load-imbalance between the threads.We attempt to mitigate this by moving this serial section to before the (parallel) u-Flather section.Since these two sections are independent, the aim is that the thread that performs the serial, v-Flather computation then performs a smaller share of the following u-Flather loop.In practice, this requires that some form of dynamic thread scheduling is used.

Thread scheduling
In order to investigate how thread scheduling affects performance we used the 'runtime' argument to the OpenMP SCHEDULE qualifier for all of our OpenMP parallel loops.The actual schedule to use can then be set at run-time using the OMP_SCHEDULE environment variable.We experimented with using the standard static, dynamic and guided (with varying chunk size) OpenMP schedules.

Construction of OpenACC-Parallel NEMOLite2D
The advantage of the PSyKAl re-structuring becomes apparent if we wish to run NEMOLite2D on different hardware, e.g. a GPU.This is because the necessary code modifications are, by design, limited to the middle PSy layer.In order to demonstrate this and to check for any limitations imposed by the PSyKAl re-structuring, we had an expert from NVIDIA port the Fortran NEMOLite2D to GPU.OpenACC directives were used as that approach is similar to the use of OpenMP directives and works well within the PSyKAl approach.In order to quantify any performance penalty incurred by taking the PSyKAl/OpenACC approach, we experimented with using CUDA directly within the original form of NEMOLite2D.

Data Movement
Although the advent of technologies such as NVLink are alleviating the bottleneck presented by the connection of the GPU to the CPU, it remains critical to minimise data movement between the memory spaces of the two processing units.In NEMO-Lite2D this is achieved by performing all computation on the GPU.The whole time-stepping loop can then be enclosed inside a single OpenACC data region and it is only necessary to bring data back to the CPU for the purposes of I/O.

Kernel Acceleration
Moving the kernels to execute on the GPU was achieved by using the OpenACC kernels directive in each Fortran routine containing loops over grid points.(In the PSyKAl version this is just the single PSy layer subroutine).This directive instructs the OpenACC compiler to automatically create a GPU kernel for each loop nest it encounters.

Force parallelisation of Flather kernels
The PGI compiler was unable to determine whether the loops applying the Flather boundary condition in the x and y directions were safe to parallelise.This information therefore had to be supplied by inserting a loop independent OpenACC directive before each of the two loops.

Loop collapse
When using the OpenACC kernels directive the compiler creates GPU kernels by parallelising only the outer loop of each loop nest.For the majority of loop nests in NEMOLite2D all of the iterations are independent and the loop bodies consist of just a handful of executable statements.For small kernels the loop start-up cost can be significant and therefore it is beneficial to generate as many threads as will fit on the GPU and then re-use them as required (i.e. if the data size is greater than the number of threads).For this approach to be efficient we must expose the maximum amount of parallelism to the GPU and we do this by instructing the compiler to parallelise both levels (i.e.ji and jj) of a loop nest.This is achieved using the OpenACC loop collapse(2) directive for all of the smaller kernels in NEMOLite2D.

Shortloop
In contrast to all of the other NEMOLite2D kernels, the Momentum kernels (u and v) are relatively large in terms of the number of executable statements they contain.In turn, this means that the corresponding kernels will require more state and thus more registers on the GPU.Therefore, rather than generating as many threads as will fit on the GPU, it is more efficient only to generate as many as required by the problem so as to minimise register pressure.Once slots become free on the GPU, new threads will launch with the associated cost amortized by the longer execution time of the larger kernel.The compiler can be persuaded to adopt the above strategy through the use of the (PGI-specific) loop shortloop directive which tells it that there is not likely to be much thread re-use in the following loop.This directive was applied to both the ji and jj loops in both of the Momentum kernels.

Kernel fusion
Both of the Momentum kernels read from 16 double-precision arrays and thus require considerable memory bandwidth.However, the majority of these arrays are used by both the u and v kernels.It is therefore possible to reduce the memory-bandwidth requirements by fusing the two kernels together.In practice, this means fusing the two, doubly-nested loops so that we have a single loop nest containing both the u and v kernel bodies/calls.

CUDA
We also experimented with using CUDA directly within the original form of NEMOLite2D in order to quantify any performance penalty incurred by taking the PSyKAl/OpenACC approach.To do this we used PGI's support for CUDA Fortran to create a CUDA kernel for the (fused) Momentum kernel.The only significant code change with this approach is the explicit set-up of the grid-and block-sizes and the way in which the kernel is launched (i.e.use of the call kernel <<< gridDim1, blockDim1 >>>(args) syntax).

Results
We first consider the performance of the code in serial and examine the effects of the transformations described in Section 2.
Once we have arrived at an optimised form for the serial version of NEMOLite2D we then investigate its parallel performance on both CPU-and GPU-based systems.

Serial Performance
In Figure 2 we plot the serial performance of the original version of the NEMOLite2D code for the range of compilers considered here.Unlike the Shallow code (Porter et al., 2016), the original version of NEMOLite2D has not been optimised.Although it is still a single source file it is, in common with NEMO itself, logically structured with separate subroutines performing different parts of the physics within each time step.This structuring and the heavy use of conditional statements favour the Intel compiler which significantly out-performs both the Gnu and Cray compilers.Only when the problem size spills out of cache does the performance gap begin to narrow.The reason for the performance deficit of the Cray-compiled binary comes down to (a lack of) SIMD vectorisation, an issue that we explore below.
Moving now the to the PSyKAl version of NEMOLite2D, Figure 3 plots the performance of the fastest PSyKAl version for each of the compiler/problem-size combinations.While the Intel compiler still produces the best-performing binary, the Craycompiled binary is now a very close second.In fact, the performance of both the Gnu-and Cray-compiled PSyKAl versions is generally significantly greater than that of their respective original versions.We also note that best absolute performance (in terms of grid points processed per second) with any compiler is obtained with the 256 2 domain.domain that is slower with the PSyKAl version of the code (and then only by some 3%).For all other points in the space, the optimised PSyKAl version of the code performs better.The results for the Cray compiler are however somewhat sqewed by the fact that it did not SIMD vectorise key parts of the original version (see below).
Having show that we can recover, and often improve upon, the performance of the original version of NEMOLite2D, the next logical step is to examine the necessary code transformations in detail.We do this for the 256 2 case since this fits within cache on the Ivy Bridge CPUs we are using here.Table 3 shows detailed performance figures for this case after each transformation has been applied to the code.The same data is visualised in Figure 5.
Looking at the results for the Gnu compiler (and the Ivy Bridge CPU) first, all of the steps-up in performance correspond to kernel in-lining.None of the other transformations had any effect on the performance of the compiled code.In fact, simply in-lining the two kernels associated with the Momentum section was sufficient to exceed the performance of the original code.With the Intel compiler, the single largest performance increase is again due to kernel in-lining (of the Momentum kernels).This is because the compiler does a much better job of SIMD vectorising the loops involved than it does when it first has to inline the kernel itself (as evidenced by its own diagnostic output -see Section 2.1.3).However, although this gives a significant performance increase it is not sufficient to match the performance of the original version.This is only achieved by in-lining every kernel and making the lack of data dependencies between arrays accessed from different field objects more explicit.
The Cray compiler is distinct from the other two in that kernel in-lining does not give any performance benefit and in fact, for the smaller kernels, it can actually hurt performance.Thus the key transformation is to encourage the compiler to SIMD vectorise the Momentum section via a compiler-specific directive (without this it concludes that such vectorisation would be inefficient).Of the other transformations, only the change to constant loop bounds and the addition of the compiler-specific safe_address directive (see Section 2.1.2) were found to (slightly) improve performance.

Parallel Performance
We now turn to transformations related to parallelisation of the NEMOLite2D code; the introduction of OpenMP and OpenACC directives.In keeping with the PSyKAl approach we do not modify either the Algorithm-or Kernel-layer code.Any code changes are restricted to either the PSy (middle) layer or the underlying library that manages e.g. the construction of field objects.

OpenMP
As with the serial optimisations, we consider the effect of each of the OpenMP optimisation steps described in Section 2.2 for the 256 2 domain.For this we principally use a single Intel, Ivy Bridge socket which has 12 hardware cores and support for up  In-line remaining kernels 5.89 12.0 11.5 In-line field copies 5.92 12.5 11.4 %-speed-up of best c.f. original 34.6 2.39 42.2  to 24 threads with hyperthreading (i.e. two threads per core).Figures 7, 8 and 9 show the performance of each of the versions of the code on this system for the Gnu, Intel and Cray compilers, respectively.
In order to quantify the scaling behaviour of the different versions of NEMOLite2D with the different compilers/run-time environments, we also plot the parallel efficiency in Figures 7, 8   improves the performance of the executable on eight or more threads while, for the Intel compiler, it only gives an improvement for the 24-thread case.Moving the SINGLE region before a parallel loop is marginally beneficial for the Gnu-and Craycompiled binaries and yet reduces the performance of the Intel binary (purple lines and right-pointing triangles).
We have used different scales for the y-axes in each of the plots in Figures 7, 8 and 9 in order to highlight the performance differences between the code versions with a given compiler.However, the best performance obtained on twelve threads (i.e. without hyperthreading) is 52.1, 55.6 and 46.3 (million points/s) for the Gnu, Intel and Cray compilers, respectively.This serves to emphasise the democratisation that the introduction of OpenMP has had; for the serial case the Cray-and Intel-compiled exectables were a factor of two faster than the Gnu-compiled binary (Figure 5).In the OpenMP version, the Gnu-compiled binary is only 6% slower than that produced by the Intel compiler and is 13% faster that that of the Cray compiler.In part this situation comes about because of the effect that adding the OpenMP compiler flag has on the optimisations performed by the compiler.In particular, the Cray compiler no longer vectorises the key Momentum section, despite the directive added during the serial optimisation work.This defficiency has been reported to Cray.
Since NEMO and similar finite-difference codes tend to be memory-bandwidth bound, we checked the sensitivity of our performance results to this quantity by benchmarking using two sockets of Intel Ivy Bridge (i.e. using a complete node of ARCHER, a Cray XC30).For this configuration we ensured that threads were evenly shared over the two sockets.The performance obtained for the 256 2 case with the 'early SINGLE' version of the code is compared with the single-socket performance in Figure 10.Surprisingly, doubling the available memory bandwidth in this way has little effect on performance -the twosocket performance figures track those from a single socket very closely.The only significant difference in performance is at 24 threads where, in addition to the difference in available memory bandwidth, the single-socket configuration is using hyperthreading while the two-socket case is not.The discrepancy in the performance of the Cray-and Intel-compiled binaries at this thread count is under investigation by Cray.A further complication is the choice of scheduling of the OpenMP threads.We have investigated the performance of each of the executables (and thus the associated OpenMP run-time library) with the standard OpenMP static, dynamic and guided scheduling policies.For the Intel compiler/run-time, static loop scheduling was found to be best for all versions apart from that where we have attempted to amortize the cost of the SINGLE section.This is to be expected since that strategy requires some form of dynamic loop scheduling in order to reduce the load imbalance introduced by the SINGLE section.
In contrast, some form of dynamic scheduling gave a performance improvement with the Gnu compiler/run-time even for the 'first-touch' version of the code.This is despite the fact that this version contains (implicit) thread synchronisation after every parallel loop.For the Cray compiler/run-time, some form of dynamic scheduling became optimal once inter-thread synchronisation was reduced using the NOWAIT qualifiers.

OpenACC
In contrast to the CPU, we only had access to the PGI compiler when looking at OpenACC performance.We therefore investigate the effect of the various optimisations described in Section 2.3 for the full range of problem sizes.All of the reported performance figures were obtained on an NVIDIA Tesla K40m GPU running in 'boost' mode (enabled with the command nvidia-smi -ac 3004,875).
The performance of the various versions of the code is plotted in Figure 11.We do not show the performance of the OpenACC version of the original code as it was identical to that of the Vanilla PSyKAl version.
The smallest domain (64 2 ) does not contain sufficient parallelism to fully utilise the GPU and only the parallelisation of the Flather kernels has much effect on performance.In fact, this is a significant optimisation for all except the largest domain size where only changes to the Momentum kernel yield sizeable gains.Collapsing the 2D loop nests has a small but beneficial effect for the majority of problem sizes.Gains here are limited by the fact that this optimisation was only beneficial for the non-Momentum kernels and these only account for some 30% of the run-time.
Given the significance of the two Momentum kernels in the execution profile it is no surprise that their optimisation yields the most significant gains.Using the shortloop directive gives roughly a 25% increase in performance for domains of 512 2 and greater.Fusing the two momentum kernels gives a further boost of around 5%.
Note that all of the previous optimisations are restricted to the PSy layer and in most cases are simply a case of adding directives or clauses to directives.However, the question then arises as to the cost of restricting optimisations to the PSy layer.In particular, how does the performance of the OpenACC version of NEMOLite2D compare with a version where those restrictions are lifted?Figure 11 also shows the performance of the version of the code where the (fused) Momentum kernel has been replaced by a CUDA kernel.For domains up to 256 2 the difference in the performance of the two versions is less than 5% and for the larger domains it is at most 12%.This demonstrates that the performance cost of the PSyKAl approach in utilising a GPU for NEMOLite2D is minimal.The discrepancy in performance between the CUDA and OpenACC versions has been fed back to the PGI compiler team.
In Figure 12 we compare the absolute performance of the OpenMP and OpenACC implementations of NEMOLite2D across the range of problem sizes considered.The OpenMP figures are the maximum performance obtained from a whole Intel Ivy Bridge socket by any version of the code on any number of threads for a given compiler/run-time.For the smallest domain size (64 2 ) the OpenMP version significantly outperforms the GPU because there is insufficient parallelism to fully utilise the GPU and one time-step takes only 80 µs.The execution of a single time-step is then dominated by the time taken to launch the kernels on the GPU rather than the execution of the kernels themselves.Once the problem size is increased to 128 2 , a single time-step takes roughly 200 µs and only the Intel-compiled OpenMP version is comparable in performance to the OpenACC version.For all of the larger problem sizes plotted in Figure 12 the GPU version is considerably faster than the CPU.For problem sizes of 1024 2 and greater, the 30MB cache of the Ivy Bridge CPU is exhausted and performance becomes limited by the bandwidth to main memory.At this stage the OpenACC version of the code on the GPU is some 3.4 times faster than the best OpenMP version on the CPU.

Performance Analysis with the Roofline Model
Although we have investigated how the performance of the PSyKAl version of NEMOLite2D compares with that of the original, we have not addressed how efficient the original actually is.Without this information we have no way of knowing whether further optimisations might yield worthwhile performance improvements.Therefore, we consider the performance of the original, serial NEMOLIte2D code on the Intel Ivy Bridge CPU and use the Roofline Model (Williams et al., 2009) which provides a relatively simple way of characterising the performance of a code in terms of whether it is memory-bandwidth bound or compute bound.To do so we follow the approach suggested in Andreolli et al. (2015) and construct a roofline model using the STREAM (McCalpin, 1995) and LINPACK (Dongarra, 1987) benchmarks in order to obtain appropriate upper bounds on the memory bandwidth and floating-point operations per second (FLOPS), respectively.Since we are using an Intel Ivy Bridge CPU we used the Intel Math Kernel Library implementation of LINPACK.
A key component of the Roofline model is the Operational or Arithmetic Intensity (AI) of the code being executed:

AI =
No. of floating-point operations Bytes fetched from memory We calculated this quantity manually by examining the source code and counting the number of memory references and arithmetic operations that it contained.In doing this counting we assume that any references to adjacent array elements (e.g.u(i, j) and u(i + 1, j)) are fetched in a single cache line and thus only count once.Results for the former are for the 256 2 domain since that gave the best performance.See the text for a discussion of the different CPU ceilings (dashed lines).
In Figure 13 we show the performance of kernels from both Shallow and NEMOLite2D on the roofline model for an Intel E5-1620 CPU.This demonstrates that the Shallow kernel is achieving a performance roughly consistent with saturating the available memory bandwidth.In contrast, the kernel taken from NEMOLite2D is struggling to reach a performance consistent with saturating the bandwidth to main memory.We experimented with reducing the problem size (so as to ensure it fitted within cache) but that did not significantly improve the performance of the kernel.This then points to more fundamental issues with the way that the kernel is implemented which are not captured in the simple roofline model.
In order to aid our understanding of kernel performance we have developed a tool, "Habakkuk" (https://github.com/arporter/habakkuk),The application of code transformations to the middle/PSy layer is key to the performance of the PSyKAl version of a code.
For both NEMOLite2D and Shallow we have found that for serial performance, the most important transformation is that of in-lining the kernel source at the call site, i.e. within the PSy layer.(Although we have done this in-lining manually for this work, our aim is that in future, such transformations will be performed automatically at compile-time and therefore do not affect the code that a scientist writes.)For the more complex NEMOLite2D code the Cray compiler also had to be coerced into performing SIMD vectorisation through the use of souce-code directives.
In this work we have also demonstrated the introduction of parallelism into the PSy layer with both OpenMP and OpenACC directives.In both cases we were able to leave the natural-science parts of the code unchanged.For OpenMP we achieved a parallel efficiency of ∼ 70% on 12 threads by enclosing the body of the (single) PSy layer routine within a single PARALLEL region.Removal of unecessary synchronisation points through use of the NOWAIT clause boosted 12-thread performance by approximately 10% with the Gnu and Cray compilers.The PSyKAl re-structuring of this (admittedly small) code was not found to pose any problems for the introduction of performant OpenMP.Similarly, we have also demonstrated good GPU performance using OpenACC in a PSyKAl version of the code.
This paper demonstrates that the PSyKAl separation of concerns may be applied to 2D finite-difference codes without loss of performance.We have also shown that the resulting code is amenable to efficient parallelisation on both GPU and sharedmemory CPU systems.This then means that it is possible to achieve performance portability while maintaining single-source science code.
Our next steps will be first, to consider the automatic generation of the PSy layer and second, to look at extending the approach to the full NEMO model (i.e. three dimensions).In future work we will analyse the performance of a domainspecific compiler that performs the automatic generation of the PSy layer.This compiler (which we have named 'PSyclone', see https://github.com/stfc/psyclone) is currently under development.

Figure 1 .
Figure 1.A schematic of the top-level of the PSyKAl version of the NEMOLite2D code.The kernels listed as arguments to the invoke call specify the operations to be performed.
Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2017-150Manuscript under review for journal Geosci.Model Dev. Discussion started: 13 July 2017 c Author(s) 2017.CC BY 4.0 License.quantities were accessed via field objects that were themselves read-only for the kernel at hand.For instance, the call to the continuity kernel, which confused the Intel compiler, originally looked like this: call continuity_code(ji, jj, Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2017-150Manuscript under review for journal Geosci.Model Dev. Discussion started: 13 July 2017 c Author(s) 2017.CC BY 4.0 License.We explore the extent to which performance depends upon the problem size by using square domains of dimension 64, 128, 256, 512 and 1024 for the traditional, cache-based CPU systems.This range allows us to investigate what happens when cache Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2017-150Manuscript under review for journal Geosci.Model Dev. Discussion started: 13 July 2017 c Author(s) 2017.CC BY 4.0 License.

Figure 4 Figure 2 .Figure 3 .
Figure4plots the percentage difference between the performance of the original and the best PSyKAl versions of NEMO-Lite2D for each compiler/problem-size combination.This shows that it is only the Intel-compiled binary running the 64 2 Dev. Discuss., https://doi.org/10.5194/gmd-2017-150Manuscript under review for journal Geosci.Model Dev. Discussion started: 13 July 2017 c Author(s) 2017.CC BY 4.0 License.

Figure 4 .
Figure 4. Comparison of the performance of the best PSyKAl version with that of the original version of the code.A negative value indicates that the PSyKAl version is slower than the original.

Figure 5 .
Figure 5.Serial performance of the PSyKAl version of NEMOLite2D for the 256 2 domain at each stage of optimisation.The first (black) bar of each cluster gives the performance of the original version of NEMOLite2D for that compiler/CPU combination.

Figure 7 .
Figure 7. Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D for the 256 2 domain with the Gnu compiler on a single Intel Ivy Bridge socket.The corresponding parallel efficiencies are shown using open symbols and dashed lines.The 24-thread runs employed hyperthreading and the optimal OpenMP schedule is given in parentheses.

Figure 8 .Figure 9 .
Figure 8. Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D for the Intel compiler on a single Intel Ivy Bridge socket.The corresponding parallel efficiencies are shown using open symbols and dashed lines.The 24-thread runs employed hyperthreading and the optimal OpenMP schedule is given in parentheses.

Figure 10 .
Figure 10.Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D on one and two sockets of Intel Ivy Bridge.The 24-thread runs on a single socket used hyperthreading and the two-socket runs had the threads shared equally between the sockets.

Figure 11 .
Figure 11.Performance of the PSyKAl (OpenACC) and CUDA GPU implementations of NEMOLite2D.Optimisations are added to the code in an incremental fashion.The performance for the OpenACC version of the original code is not shown since it is identical to that of the Vanilla PSyKAl version.

Figure 12 .
Figure 12.Performance of the best OpenMP-parallel version of PSyKAl NEMOLite2D (on a single Intel Ivy Bridge socket) compared with the PSyKAl GPU implementation (using OpenACC).

Figure 13 .
Figure 13.Comparison of the performance achieved by kernels from NEMOLite2D and Shallow on a roof-line plot for the E5-1620 CPU.
capable of parsing Fortran code and generating a Directed Acyclic Graph (DAG) of the data flow.Habakkuk eases the laborious and error-prone process of counting memory accesses and FLOPs as well as providing information on those operations that are rate-limiting or on the critical path.Using the details of the Intel Ivy Bridge microarchitecture published by Fog(Fog,   10    2016b, a) we have constructed performance estimates of the NEMOLite2D kernel.By ignoring all Instruction-Level Parallelism (ILP), i.e. assuming that all nodes in the DAG are executed in serial, we get a lower-bound performance estimate of 0.6391 × CLOCK_SP EED FLOPS which gives 2.46 GFLOPS at a clock speed of 3.85 GHz.Alternatively, we may construct an upper bound by assuming the out-of-order execution engine of the Ivy Bridge core is able to perfectly schedule and pipeline all operations such that those that go to different execution ports are always run in parallel.Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2017-150Manuscript under review for journal Geosci.Model Dev. Discussion started: 13 July 2017 c Author(s) 2017.CC BY 4.0 License.

Table 1 .
The compiler flags used in this work.

Table 2 .
The performance profile of the original and PSyKAl versions of NEMOLite2D on the Intel Ivy Bridge CPU (for 2000 time-steps of the 128 2 domain and the Intel compiler).

Table 3 .
Performance (millions of points updated per second) on an Intel Ivy Bridge CPU for the 256 2 case after each code transformation.Where an optimisation utilises compiler-specific directives then performance figures for the other compilers are omitted.