We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel finite-difference, finite-volume, and finite-element earth-system models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelization and single-core optimizations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimization specialists to be able to tailor the code for a particular machine, independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, etc.). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle, PSy, layer in order to achieve good performance, both serial and parallel. We have produced versions of the PSy layer parallelized with both OpenMP and OpenACC; in both cases we were able to leave the natural-science parts of the code unchanged while achieving good performance on both multi-core CPUs and GPUs. In quantifying whether or not the obtained performance is “good” we also consider the limitations of the basic roofline model and improve on it by generating kernel-specific CPU ceilings.
The challenge presented to the developers of scientific software by the drive
towards exascale computing is considerable. With power consumption becoming
the overriding design constraint, CPU clock speeds are falling and the
complex multi-purpose compute core is being replaced by multiple simpler
cores. This philosophy can be seen at work in the rise of so-called
accelerator-based machines in the Top 500 List of supercomputers
(
Achieving good performance on large numbers of light-weight cores requires exploiting as much parallelism in an application as possible and this results in increased complexity in the programming models that must be used. This in turn increases the burden of code maintenance and code development, in part because two specialisms are required: that of the scientific domain which a code is modelling (e.g. oceanography) and that of computational science. The situation is currently complicated still further by the existence of competing hardware technology; if one was to begin writing a major scientific application today it is unclear whether one would target GPU, Xeon Phi, traditional CPU, FPGA, or something else entirely. This is a problem because, generally speaking, these different technologies require different programming approaches.
In a previous paper
The PSyKAl approach attempts to address the problems described in the
previous section. It separates code into three layers: the Algorithm layer,
the PSy layer, and the Kernel layer. The approach has been developed in the
GungHo project
While the PSyKAl approach is general, we are currently applying it to atmosphere and ocean models written in Fortran where domain decomposition is typically performed in the latitude–longitude dimension, leaving columns of elements on each domain-decomposed partition.
The top layer, in terms of calling hierarchy, is the Algorithm layer. This layer specifies the algorithm that the scientist would like to perform (in terms of calls to kernel and infrastructure routines) and logically operates on full fields. We say logically here as the fields may be domain decomposed; however, the Algorithm layer is not aware of this. It is the scientist's responsibility to write this Algorithm layer.
The bottom layer, in terms of calling hierarchy, is the Kernel layer. The Kernel layer implements the science that the Algorithm layer calls, as a set of subroutines. These kernels operate on fields that are local to the process doing the computation. (Depending on the type of kernel, these may be a set of elements, a single column of elements, or a set of columns.) Again the scientist is responsible for writing this layer and there is no parallelism specified here, but, depending on the complexity of the kernels, there may be input from an High Performance Computing (HPC) expert and/or some coding rules to help ensure that the kernels compile into efficient code.
The PSy layer sits in-between the Algorithm and Kernel layers and its functional role is to link the algorithm calls to the associated kernel subroutines. As the Algorithm layer works on logically global fields and the Kernel layer works on local fields, the PSy layer is responsible for iterating over columns. It is also responsible for including any distributed-memory operations resulting from the decomposition of the simulation domain, such as halo swaps and reductions.
As the PSy layer iterates over columns, the single-core performance can be optimized by applying transformations such as manipulation of loop bounds (e.g. padding for single instruction multiple data (SIMD) instructions) and kernel in-lining. Additionally, the potential parallelism within this iteration space can also be exploited and optimized. The PSy layer can therefore be tailored for a particular hardware (such as multi-core, many-core, GPUs, or some combination thereof) and software (such as compiler, operating system, message passing interface (MPI) library, etc.) configuration with no change to the Algorithm or Kernel layer code. This approach therefore offers the potential for portable performance. In this work we apply optimizations to the PSy layer manually. The development of a tool to automate this process will be the subject of a future paper.
Clearly the separation of code into distinct layers may have an effect on performance. This overhead – how to get back to the performance of a parallel, hand-optimized code, and potentially improve on it – will be discussed in the remainder of this paper.
This paper is concerned with the implications of PSyKAl as a design for code architecture. The implementation of an associated tool (which we have named “PSyclone”) for generating the middle, PSy, layer will be the subject of a future paper. However, comparison with other approaches necessarily involves discussing other tools rather than simply architectures.
As already mentioned, our approach is heavily influenced by the OP2 system
In the PSyKAl approach there is an implicit assumption that the majority of
the kernels in an application will be provided by the application developer.
In the GridTools
The PSyKAl, GridTools, and Firedrake approaches are all based on the concept
of (various flavours of) a domain-specific language (DSL) for
finite-difference and finite-element applications. This is distinct from
other, lower-level abstractions such as Kokkos
An overview of the functionality of similar approaches. Static compilation here means that all code is compiled before programme execution is begun.
For this work we have used the NEMOLite2D programme, developed by ourselves
(
The external forcing includes surface wind stress, bottom friction, and
open-boundary barotropic forcing. A lateral-slip boundary condition is
applied along the coast lines. The open boundary condition can be set as a
clipped or Flather's radiation condition
The traditional Arakawa C structured grid is employed here for the
discretization of the computational domain. A two-dimensional integer array
is used to identify the different parts of the computational domain; it has
the value of 1 for ocean, 0 for land, and
For the sake of simplicity, the explicit Eulerian forward-time-stepping method is implemented here, except that the bottom friction takes a semi-implicit form. The Coriolis force can be set in explicit or implicit form. The advection term is computed with a first-order upwind scheme.
The sequence of the model computation is as follows:
Set the initial conditions (water depth, sea surface height, velocity); integrate the continuity equation for the new sea surface height; update the different terms in the right hand side of the momentum equations;
advection, Coriolis forcing (if set in explicit form), pressure gradient, and horizontal
viscosity; update the velocity vectors by summing up the values in (3), and implicitly
with the bottom friction and Coriolis forcing (if set in implicit form); apply the boundary conditions on the open- and solid-boundary cells.
Since any real oceanographic computational model must output results, we ensure that any PSyKAl version of NEMOLite2D retains the Input/Output capability of the original. This aids in limiting the optimizations that can be performed on the PSyKAl version to those that should also be applicable to full oceanographic models. Note that although we retain the I/O functionality, all of the results presented in this work carefully exclude the effects of I/O since it is compute performance that interests us here.
In the Algorithm layer, fields (and grids) are treated as logically global objects. Therefore, as part of creating the PSyKAl version of NEMOLite2D, we represent fields with derived types instead of arrays in this layer. These types then hold information about the associated mesh and the extents of “internal” and “halo” regions as well as the data arrays themselves. This frees the natural scientist from having to consider these issues and allows for a certain degree of flexibility in the actual implementation (e.g. padding for alignment or increasing array extent to allow for other optimizations). The support for this is implemented as a library (which we term the GOcean Infrastructure) and is common to the PSyKAl versions of both NEMOLite2D and Shallow.
In restructuring NEMOLite2D to conform to the PSyKAl separation of concerns
we must break up the computation into multiple kernels. The more of these
there are, the greater the potential for optimization of the PSy layer.
This restructuring gave eight distinct kernels, each of which updates a
single field at a single point (since we have chosen to use point-wise
kernels). With a little bit of tidying and restructuring, we found it was
possible to express the contents of the main time-stepping loop as a single
invoke (a call to the PSy layer) and a call to the I/O system
(Fig.
A schematic of the top-level of the PSyKAl version of the NEMOLite2D
code. The kernels listed as arguments to the
As with any full oceanographic model, boundary conditions must be applied at the edges of the model domain. Since NEMOLite2D applies external boundary conditions (e.g. barotropic forcing), this is done via user-supplied kernels.
Our aim in this work is to achieve portable performance, especially between multi-core CPU and many-core GPU systems. Consequently, we have performed tests on both an Intel Ivy Bridge CPU (E5-2697 at 2.7 GHz) and on an NVIDIA Tesla K40 GPU. On the Intel-based system we have used the Gnu, Intel, and Cray Fortran compilers (versions 4.9.1, 15.0.0.090, and 8.3.3, respectively). The code that made use of the GPU was compiled using version 15.10 of the PGI compiler.
We first describe the code transformations performed for the serial version of NEMOLite2D. We then move on to the construction of parallel versions of the code using OpenMP and OpenACC. Again, we describe the key steps we have taken in this process in order to maximize the performance of the code. In both cases our aim is to identify those transformations which must be supported by a tool which seeks to auto-generate a performant PSy layer.
In Table
The compiler flags used in this work.
Before applying any code transformations, we first benchmark the original
serial version of the code. We also benchmark the unoptimized vanilla
version after it has been restructured following the PSyKAl approach. In
addition to this benchmarking, we profile these versions of the code at the
algorithm level (using a high-level timing API). The resulting profiles are
given in Table
The performance profile of the original and PSyKAl versions of
NEMOLite2D on the Intel Ivy Bridge CPU (for 2000 time steps of the
Beginning with the vanilla PSyKAl version, we then apply a series of code transformations while obeying the PSyKAl separation of concerns, i.e. optimization is restricted to the middle, PSy, layer and leaves the kernel and algorithm layers unchanged. The aim of these optimizations is to recover, as much as is possible, the performance of the original version of the code. The transformations we have performed and the reasons for them are described in the following sections.
In the vanilla PSy layer, the lower and upper bounds for each loop over
grid points are obtained from the relevant components of the derived type
representing the field being updated by the kernel being called from within
the loop. In our previous work
Many of the optimizations we have performed have been informed by the
diagnostic output produced by either the Cray or Intel compilers. Many of the
NEMOLite2D kernels contain conditional statements. These statements are there
to check whether, for example, the current grid point is wet or neighbours a boundary
point. A compiler is better able to optimize such a loop if it can be sure
that all array accesses within the body of the loop are safe for every trip,
irrespective of the conditional statements. In its diagnostic output the Cray
compiler notes this with messages of the form:
The profiling data in Table
From our previous work
Although the Intel compiler can perform in-lining when routines are in
separate source files, we have found (both here and in our previous work;
In fact, in-lining can have a significant effect on the Intel compiler's
ability to vectorize a loop. Taking the loop that calls the kernel for the
For the best possible performance, we have therefore chosen to do full, manual inlining for the two kernels making up the Momentum section.
It turns out that the Cray-compiled binaries of both the original and PSyKAl
versions of NEMOLite2D perform considerably less well than their
Intel-compiled counterparts. Comparison of the diagnostic output from each of
the compilers revealed that while the Intel compiler was happy to vectorize
the Momentum loops, the Cray compiler was choosing not to.
Having optimized the Momentum section as much as permitted by the PSyKAl
approach, we turn our attention to the three remaining sections of the code.
The profile data in Table
Comparison of the diagnostic output from the Cray and Intel compilers
revealed that the Cray compiler was vectorizing the Continuity section while
the Intel compiler reported that it was unable to do so due to dependencies.
After some experimentation we found that this was due to limitations in the
compiler's analysis of the way components of Fortran derived types were being
used. Each GOcean field object, in addition to the array holding the local
section of the field, contains a pointer to a GOcean grid object. If a kernel
requires grid-related quantities (e.g. the grid spacing) then these are
obtained by passing it a reference to the appropriate array within the grid
object. Although these grid-related quantities are read-only within a compute
kernel, if they were referenced from the same field object as that containing
an array to which the kernel writes then the Intel compiler identified a
dependency preventing vectorization. This limitation was simply removed by
ensuring that all read-only quantities were accessed via field objects that
were themselves read-only for the kernel at hand. For instance, the call to
the continuity kernel, which confused the Intel compiler, originally looked
like this:
As with the Momentum kernel, we know that obtaining optimal performance from both the Gnu and Intel compilers requires that a kernel be manually in-lined at its call site. We do this for the Continuity kernel in this optimization step.
Having optimized the Continuity section we finally turn our attention to the Boundary Condition and Time-update sections. The kernels in these sections are small and dominated by conditional statements. We therefore limited our optimization of them to manually in-lining each of the kernels into the PSy layer.
The Time-update section includes several array copies where fields for the
current time step become the fields at the previous time step. Initially we
implemented these copies as “built-in” kernels (in the GOcean
infrastructure) as they are specified in the Algorithm layer. However, we
obtained better performance (for the Gnu and Intel compilers) by simply
manually in-lining these array copies into the PSy layer. As discussed in
Sect.
We shall see that the transformations we have just described do not always
result in improved performance. Whether or not they do so depends both on the
compiler used and the problem size. We also emphasize that the aim of these
optimizations is to make the PSy layer as compiler-friendly as possible,
following the lessons learned from our previous work with the Shallow code
We explore the extent to which performance depends upon the problem size by using square domains of dimension 64, 128, 256, 512, and 1024 for the traditional cache-based CPU systems. This range allows us to investigate what happens when cache is exhausted as well as giving us some insight into the decisions that different compilers make when optimizing the code.
For this part of the work we began with the optimized PSyKAl version of the code, as obtained after applying the various transformations described in the previous section. As with the transformations of the serial code, our purpose here is to determine the functionality required of a tool that seeks to generate the PSy layer.
The simplest possible OpenMP-parallel implementation consists of
parallelizing each loop nest in the PSy layer. This was done by
inserting an OpenMP PARALLEL DO directive before each loop nest so
that the iterations of the outermost or
The loop nest dealing with the application of the Flather boundary condition
to the y component of velocity ( Only once
this work was complete did we establish that boundary conditions are enforced
such that it can safely be executed in parallel.
Although very simple to implement, the use of separate PARALLEL DO
directives results in a lot of thread synchronization and can also
cause the team of OpenMP threads to be repeatedly created and
destroyed. This may be avoided by keeping the thread team in existence
for as long as possible using an OpenMP PARALLEL region. We therefore
enclosed the whole of the PSy layer (in this code, a single
subroutine) within a single PARALLEL region. The directive preceding
each loop nest to be parallelized was then changed to an OpenMP DO. We
ensured that the
When executing an OpenMP-parallel programme on a non-uniform memory access (NUMA) compute node it becomes important to ensure that the memory locations accessed by each thread are local to the hardware core upon which it is executing. One way of doing this is to implement a so-called “first-touch policy” whereby memory addresses that will generally be accessed by a given thread during programme execution are first initialized by that thread. This is simply achieved by using an OpenMP-parallel loop to initialize newly allocated arrays to some value, e.g. zero.
Since data arrays are managed within the GOcean infrastructure, this optimization can again be implemented without changing the natural-science code (i.e. the Application and Kernel layers).
By default, the OpenMP END DO directive includes an implicit barrier, thus causing all threads to wait until the slowest has completed the preceding loop. Such synchronization limits performance at larger thread counts and, for the NEMOLite2D code, is frequently unnecessary. For example, if a kernel does not make use of the results of a preceding kernel call, then there is clearly no need for threads to wait between the two kernels.
We analysed the interdependencies of each of the code sections within the PSy layer and removed all unnecessary barriers by adding the NOWAIT qualifier to the relevant OpenMP END DO or END SINGLE directives. This reduced the number of barriers from 11 down to 4.
As previously mentioned, the
In order to investigate how thread scheduling affects performance we used the “runtime” argument to the OpenMP SCHEDULE qualifier for all of our OpenMP parallel loops. The actual schedule to use can then be set at runtime using the OMP_SCHEDULE environment variable. We experimented with using the standard static, dynamic, and guided (with varying chunk size) OpenMP schedules.
The advantage of the PSyKAl restructuring becomes apparent if we wish to run NEMOLite2D on different hardware, e.g. a GPU. This is because the necessary code modifications are, by design, limited to the middle PSy layer. In order to demonstrate this and to check for any limitations imposed by the PSyKAl restructuring, we had an expert from NVIDIA port the Fortran NEMOLite2D to GPU. OpenACC directives were used as that approach is similar to the use of OpenMP directives and works well within the PSyKAl approach. In order to quantify any performance penalty incurred by taking the PSyKAl/OpenACC approach, we experimented with using CUDA directly within the original form of NEMOLite2D.
Although the advent of technologies such as NVLink are alleviating the
bottleneck presented by the connection of the GPU to the CPU, it
remains critical to minimize data movement between the memory spaces
of the two processing units. In NEMOLite2D this is achieved by
performing all computation on the GPU. The whole time-stepping loop
can then be enclosed inside a single OpenACC
Moving the kernels to execute on the GPU was achieved by using the OpenACC
The PGI compiler was unable to determine whether the loops applying
the Flather boundary condition in the
When using the OpenACC
In contrast to all of the other NEMOLite2D kernels, the Momentum kernels (
Both of the Momentum kernels read from 16 double-precision arrays and
thus require considerable memory bandwidth. However, the majority of
these arrays are used by both the
We also experimented with using CUDA directly within the original form of
NEMOLite2D in order to quantify any performance penalty incurred by taking
the PSyKAl/OpenACC approach. To do this we used PGI's support for CUDA
Fortran to create a CUDA kernel for the (fused) Momentum kernel. The only
significant code change with this approach is the explicit set-up of the
grid- and block-sizes and the way in which the kernel is launched (i.e. use
of the
We first consider the performance of the code in serial and examine the
effects of the transformations described in Sect.
In Fig.
Summary of the performance of the original version of the NEMOLite2D code on an Intel Ivy Bridge CPU for the range of compilers under consideration.
Moving now to the PSyKAl version of NEMOLite2D,
Fig.
Summary of the best performance achieved by any PSyKAl version of NEMOLite2D for each of the compilers under consideration.
Figure
Comparison of the performance of the best PSyKAl version with that of the original version of the code. A negative value indicates that the PSyKAl version is slower than the original.
Having shown that we can recover, and often improve upon, the performance of
the original version of NEMOLite2D, the next logical step is to examine the
necessary code transformations in detail. We do this for the
Looking at the results for the Gnu compiler (and the Ivy Bridge CPU) first, all of the steps up in performance correspond to kernel in-lining. None of the other transformations had any effect on the performance of the compiled code. In fact, simply in-lining the two kernels associated with the Momentum section was sufficient to exceed the performance of the original code.
With the Intel compiler, the single largest performance increase is again due
to kernel in-lining (of the Momentum kernels). This is because the compiler
does a much better job of SIMD vectorizing the loops involved than it does
when it first has to in-line the kernel itself (as evidenced by its own
diagnostic output – see Sect.
The Cray compiler is distinct from the other two in that kernel in-lining
does not give any performance benefit and in fact, for the smaller kernels,
it can actually hurt performance. Thus the key transformation is to encourage
the compiler to SIMD vectorize the Momentum section via a compiler-specific
directive (without this it concludes that such vectorization would be
inefficient). Of the other transformations, only the change to constant loop
bounds and the addition of the compiler-specific
Performance (millions of points updated per second) on an Intel Ivy
Bridge CPU for the
Serial performance of the PSyKAl version of NEMOLite2D for the
We now turn to transformations related to parallelization of the NEMOLite2D code; the introduction of OpenMP and OpenACC directives. In keeping with the PSyKAl approach, we do not modify either the Algorithm- or Kernel-layer code. Any code changes are restricted to either the PSy (middle) layer or the underlying library that manages, for example, the construction of field objects.
As with the serial optimizations, we consider the effect of each of the
OpenMP optimization steps described in Sect.
In order to quantify the scaling behaviour of the different versions of
NEMOLite2D with the different compilers or runtime environments, we also plot
the parallel efficiency in Figs.
Since the space to explore consists of three different compilers, six
different domain sizes, five stages of optimization and six different thread
counts, we can only consider a slice through it in what follows. In order to
inform our choice of domain size, Fig.
The scaling behaviour of the most performant OpenMP-parallel version of PSyKAl NEMOLite2D for the full range of domain sizes considered on the CPU. Results are for the Intel compiler on a single Intel Ivy Bridge socket. The corresponding parallel efficiencies are shown using open symbols and dashed lines. The 24-thread runs employed hyperthreading.
Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D for
the
The simplest OpenMP implementation (black lines, circle symbols) fails to
scale well for any of the compilers. For the Intel and Gnu versions, parallel
efficiency is already less than 50 % on just four threads. The Cray
version, however, does better and is about 45 % efficient on eight threads
(right axis, Fig.
With the move to a single PARALLEL region, the situation is greatly improved
with all three executables now scaling out to at least 12 threads with
In restricting ourselves to a single socket, we are keeping all threads
within a single NUMA region. It is therefore surprising that implementing a
“first-touch” policy has any effect and yet, for the Gnu- and
Intel-compiled binaries, it appears to improve performance when
hyperthreading is employed to run on 24 threads (green lines and diamond
symbols in Figs.
Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D for the Intel compiler on a single Intel Ivy Bridge socket. The corresponding parallel efficiencies are shown using open symbols and dashed lines. The 24-thread runs employed hyperthreading and the optimal OpenMP schedule is given in parentheses.
Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D for the Cray compiler on a single Intel Ivy Bridge socket. The corresponding parallel efficiencies are shown using open symbols and dashed lines. The 24-thread runs employed hyperthreading and the optimal OpenMP schedule is given in parentheses.
The final optimization step that we found to have any significant effect is to minimize the amount of thread synchronization by introducing the NOWAIT qualifier wherever possible (blue lines and upward-triangle symbols). For the Gnu compiler, this improves the performance of the executable on eight or more threads, while, for the Intel compiler, it only gives an improvement for the 24-thread case. Moving the SINGLE region before a parallel loop is marginally beneficial for the Gnu- and Cray-compiled binaries and yet reduces the performance of the Intel binary (purple lines and right-pointing triangles).
We have used different scales for the
Performance of the OpenMP-parallel version of PSyKAl NEMOLite2D on one and two sockets of Intel Ivy Bridge. The 24-thread runs on a single socket used hyperthreading and the two-socket runs had the threads shared equally between the sockets.
Since NEMO and similar finite-difference codes tend to be memory-bandwidth
bound, we checked the sensitivity of our performance results to this quantity
by benchmarking using two sockets of Intel Ivy Bridge (i.e. using a complete
node of ARCHER, a Cray XC30). For this configuration, we ensured that threads
were evenly shared over the two sockets. The performance obtained for the
A further complication is the choice of scheduling of the OpenMP
threads. We have investigated the performance of each of the
executables (and thus the associated OpenMP runtime library) with the
standard OpenMP
In contrast, some form of dynamic scheduling gave a performance improvement with the Gnu compiler/runtime even for the “first-touch” version of the code. This is despite the fact that this version contains (implicit) thread synchronization after every parallel loop. For the Cray compiler/runtime, some form of dynamic scheduling became optimal once inter-thread synchronization was reduced using the NOWAIT qualifiers.
In contrast to the CPU, we only had access to the PGI compiler when looking
at OpenACC performance. We therefore investigate the effect of the various
optimizations described in Sect.
The performance of the various versions of the code is plotted in
Fig.
Performance of the PSyKAl (OpenACC) and CUDA GPU implementations of NEMOLite2D. Optimizations are added to the code in an incremental fashion. The performance for the OpenACC version of the original code is not shown since it is identical to that of the vanilla PSyKAl version.
The smallest domain (
Given the significance of the two Momentum kernels in the execution profile,
it is no surprise that their optimization yields the most significant gains.
Using the
Note that all of the previous optimizations are restricted to the PSy layer
and in most cases are simply a case of adding directives or clauses to
directives. However, the question then arises as to the cost of restricting
optimizations to the PSy layer. In particular, how does the performance of
the OpenACC version of NEMOLite2D compare with a version where those
restrictions are lifted? Figure
In order to check the efficiency of the CUDA implementation, we profiled the
code on the GPU. For large problem sizes this showed that the non-Momentum
kernels are memory-bandwidth bound and account for about 40 % of the
runtime. However, the Momentum kernels were latency limited, getting
All of this demonstrates that the performance cost of the PSyKAl approach in utilizing a GPU for NEMOLite2D is minimal. The discrepancy in performance between the CUDA and OpenACC versions has been fed back to the PGI compiler team.
In Fig.
Performance of the best OpenMP-parallel version of PSyKAl NEMOLite2D (on a single Intel Ivy Bridge socket) compared with the PSyKAl GPU implementation (using OpenACC).
Once the problem size is increased to
Although we have investigated how the performance of the PSyKAl version of
NEMOLite2D compares with that of the original, we have not addressed how
efficient the original actually is. Without this information we have no way
of knowing whether further optimizations might yield worthwhile performance
improvements. Therefore, we consider the performance of the original serial
NEMOLite2D code on the Intel Ivy Bridge CPU and use the Roofline model
A key component of the Roofline model is the operational or arithmetic
intensity (AI) of the code being executed:
Comparison of the performance achieved by kernels from NEMOLite2D
and Shallow on a roofline plot for the E5-1620 CPU. Results for the former
are for the
In Fig.
In order to aid our understanding of kernel performance, we have developed a
tool, “Habakkuk” (
Alternatively, we may construct an upper bound by assuming the out-of-order
execution engine of the Ivy Bridge core is able to perfectly schedule and
pipeline all operations such that those that go to different execution ports
are always run in parallel. In the Ivy Bridge core, floating-point
multiplication and division operations go to port 0 while addition and
subtraction go to port 1
These performance estimates are plotted as CPU ceilings (coloured, horizontal lines) in
Fig.
Enabling SIMD vectorization for this kernel does not significantly improve
its performance (Fig.
We have investigated the application of the PSyKAl separation of concerns approach to the domain of shared-memory, parallel, finite-difference shallow-water models. This approach enables the computational-science-related aspects (performance) of a computer model to be kept separate from the natural-science aspects (oceanographic).
We have used a new and unoptimized two-dimensional model extracted from the
NEMO ocean model for this work. As a consequence of this, the introduction of
the PSyKAl separation of concerns followed by suitable transformations of the
PSy layer is actually found to improve performance. This is in contrast to
our previous experience
Investigation of the absolute serial performance of the NEMOLite2D code using the Roofline model revealed that it was still significantly below any of the traditional roofline ceilings. We have developed Habakkuk, a code-analysis tool that is capable of providing more realistic ceilings by analysing the nature of the floating-point computations performed by a kernel. The bounds produced by Habakkuk are in good agreement with the measured performance of the principal (Momentum) kernel in NEMOLite2D. In future work we aim to extend this tool to account for SIMD operations and make it applicable to code parallelized using OpenMP.
The application of code transformations to the middle, PSy, layer is key to the performance of the PSyKAl version of a code. For both NEMOLite2D and Shallow we have found that for serial performance, the most important transformation is that of in-lining the kernel source at the call site, i.e. within the PSy layer. (Although we have done this in-lining manually for this work, our aim is that, in future, such transformations will be performed automatically at compile-time and therefore do not affect the code that a scientist writes.) For the more complex NEMOLite2D code, the Cray compiler also had to be coerced into performing SIMD vectorization through the use of source-code directives.
In this work we have also demonstrated the introduction of parallelism into
the PSy layer with both OpenMP and OpenACC directives. In both cases we were
able to leave the natural-science parts of the code unchanged. For OpenMP we
achieved a parallel efficiency of
This paper demonstrates that the PSyKAl separation of concerns may be applied to 2-D finite-difference codes without loss of performance. We have also shown that the resulting code is amenable to efficient parallelization on both GPU and shared-memory CPU systems. This then means that it is possible to achieve performance portability while maintaining single-source science code.
Our next steps will be, first, to consider the automatic generation of the
PSy layer and, second, to look at extending the approach to the full NEMO
model (i.e. three dimensions). In future work we will analyse the performance
of a domain-specific compiler that performs the automatic generation of the
PSy layer. This compiler (which we have named “PSyclone”, see
The NEMOLite2D Benchmark Suite 1.0 is available from the
eData repository at:
HL wrote the original NEMOLite2D, based on code from the NEMO ocean model. AP and RF restructured NEMOLite2D following the PSyKAl separation of concerns. AP performed the CPU performance optimizations. JA ported NEMOLite2D to OpenACC and optimized its GPU performance. AP prepared the manuscript with contributions from all co-authors.
The authors declare that they have no conflict of interest.
This work made use of the ARCHER UK National Supercomputing Service
(