Improve Ocean Modelling Software NEMO 4.0 benchmarking and communication efficiency

Communications in distributed memory supercomputers are still limiting scalability of geophysical models. Considering the recent trends of the semiconductor industry, we think this problem is here to stay. We present the optimisations that have been implemented in the actual 4.0 reference version of the ocean model NEMO 4.0 to improve its scalability. Thanks to the collaboration of oceanographers and HPC experts, we identified and removed the unnecessary communications in two bottleneck routines, the computation of free surface pressure gradient and the forcing in the straights or unstructured open 5 boundaries. Since a wrong parallel decomposition choice could undermine computing performance, we impose its automatic definition in all cases, including when subdomains containing land points only are excluded from the decomposition. For a smaller audience of developers and vendors, we propose a new benchmark configuration, easy to use while offering the full complexity of operational versions.

and Team). Note that the land only subdomains can be suppressed from the computational domain. The number of MPI task is, in this case, smaller than jpni × jpnj (see figure 8.6 in Madec and Team).
The choice of the domain decomposition proposed by default in NEMO until version 3.6 was very basic as 2 was the only prime factor considered when looking at divisors of the number of MPI tasks! Tintó et al. (2017) underlined this deficiency.
In their figure 4, they demonstrated that the choice of the domain decomposition is a key factor to get the appropriate model 60 scalability. In their test with ORCA025 configuration on 4096 cores, the number of simulated years per day is almost multiplied by a factor of 2 when using their optimum domain decomposition instead of the default one. Benchmarking NEMO with the default domain decomposition would therefore be completely misleading. Tintó et al. (2017) proposed a strategy to optimize the choice of the domain decomposition in a preprocessing phase. Finding the best domain decomposition is thus the starting point of any work dedicated to NEMO numerical performance.
We detail in this section how we implemented a similar approach, but on-line in the initialization phase of NEMO. Our idea is to propose, by default, the best domain decomposition for a given number of MPI tasks and avoid the waste of cpu resources by non-expert users.

Optimal domain decomposition research algorithm
Our method is based on the minimization of the size of the MPI subdomains taking into account the fact that land only 70 subdomains can be excluded from the computation. The algorithm we wrote can be summarized in 3 steps: 1. Get the Land Fraction (LF : the total number of land points divided by the total number of points, thus between 0 and 1) of the configuration we are running. The Land Fraction will provide the maximum number of subdomains that could be potentially removed from the computational domain. If we want to run the model on N mpi processes we must look for a domain decomposition generating a maximum of N sub max = N mpi/(1 − LF ) subdomains, as we wont be able to 75 remove more than N mpi × LF land only subdomains.
2. We next have to provide the best domain decomposition defined by the following rules: (a) it generates a maximum of N sub max subdomains, (b) it gives the smallest subdomain size for a given value N sub of subdomains and (c) no other domain decomposition with less subdomains has a subdomain size smaller or equal. This last constrain requires in fact that we build the list of best domain decompositions incrementally, from N sub = 1 to N sub = N sub max . 80 3. Having this list, we have to test the largest value of N sub that is listed and check if once we remove its land only subdomains, we obtain a number of remaining ocean subdomains lower or equal than N mpi. If it is not the case, we discard this choice and test the next domain decomposition listed (in a decreasing order of number of subdomains) until it fits the limit of N mpi processes.
Note that we could imagine that in a few cases, a non-optimal domain decomposition would allow to remove more land 85 only subdomains than the selected optimal decomposition and become a better choice. Taking into account this possibility would increase tenfold the number of domain decompositions to be tested, which would make the selection extremely costly and impossible to use. We consider that the probability of facing such a case becomes extremely unlikely as we increase the number of subdomains (which is the usual target) as smaller subdomains fit better the coastline. We therefore ignore this possibility and consider only optimal decomposition when looking at land only subdomains. 90 If the optimal decomposition found has a number of ocean subdomains N sub smaller than N mpi, we print a warning message saying that the user provides more MPI tasks than needed and we simply keep in the computational domain (N mpi − N sub) land only subdomains in order to fit the required number of N mpi MPI tasks. This simple trick that may look quite useless for simple configurations is in fact required when using AGRIF (Debreu et al., 2008) because (as it is implemented today in NEMO) each parent and child domains share the same number of MPI tasks that can therefore rarely be optimized to 95 each domain at the same time.
The next 2 sub-sections details the keys parts of steps (1,3) and (2).

Getting land-sea mask information
We need to read the global land-sea information for 2 purposes: get the land fraction (step 1) and find the number of land only subdomains in a given domain decomposition (step 3). By default, reading NetCDF configuration files in NEMO can be 100 trivially done by the mean of a dedicated Fortran module ("iom"). However, in this specific context, things must be done more carefully to avoid a one-off but potentially extremely large memory allocation associated with a massive disk access. Until revision 3.6 included, each MPI process was reading the whole global land-sea mask, which is clearly not the proper strategy when aiming at running very large domains on large numbers of MPI processes with limited memory. The difficulty here lies in the minimization of the memory used to get the needed information from the global land-sea mask. To overcome this issue, 105 we dedicate some processes to read only horizontal stripes of the land-sea mask file. This solution requires less memory and will ensure the continuity of the data to be read which optimizes the reading process.
When looking at the land fraction for step (1), we need the total number of ocean points, and we use as much processes as we have to read the file with 2 limits: each process must read at least 2 lines and we use no more that 100 processes to avoid to saturate the file system with too many processes accessing the same input file (an arbitrary value that could be changed). The 110 number of oceanic points in each stripe is next added and shared among processes through a collective communication.
When looking at the land only subdomains in a jpni × jpnj domain decomposition for step (3), we need the number of ocean points in each of the jpni × jpnj subdomains. In this case, we read jpnj stripes of land-sea mask corresponding to bands of subdomains. This work is distributed among a maximum of the jpnj processes accessing the input file concurrently.
Each of these processes reads sequentially one or several stripes of data and communicates only the number of ocean points 115 for the jpni subdomains included in data stripes loaded in memory.

Getting the best domain decomposition sorted from 1 to N sub max subdomains
The second step of our algorithm starts with the simple fact that domain decomposition uses Euclidean division: the division of the horizontal domain by the number of processes along i and j directions rarely results in whole numbers. In consequence, some MPI subdomains will potentially be 1-point larger in i and/or j direction than others. Increasing the number of processes 120 does not always reduce the size of the largest MPI subdomains, especially when using a large number of processes compared to the global domain size. Table 1 illustrates this point with a simple example: a 1D domain of 10 points (jpiglo) distributed among jpni tasks with jpni ranging from 1 to 9. Because of the halos required for MPI communications, the total domain size is (jpiglo + 2 * jpni − 2) that must be divided by jpni. One can see that using jpni = 4 to 7 will always provide the same size for the largest subdomains: 4. Only specific values of jpni (1, 2, 3, 4 and 8) will end up in a reduction of the largest 125 subdomains size and correspond to the optimized values of jpni that should be chosen. Using other values would simply increase the number of MPI subdomains that are smaller without affecting the size of the largest subdomain. This result must be extended to the 2D domain decomposition used in NEMO. When searching all couples (jpni, jpnj) that should be considered when looking for the optimal decomposition, we can quickly reduce their number by selecting jpni and jpnj only among the values suitable for the 1D decomposition of jpiglo along the i direction and jpjglo along the j direction.

130
The number of these values corresponds roughly to the number of divisor of jpiglo and jpjglo, which can be approximated by 2 √ jpiglo and 2 √ jpjglo. This first selection is thus providing about 2 √ jpiglo × jpjglo couples (jpni, jpnj) instead of the (jpiglo × jpjglo) couples that could be defined by default. Next, we discard from the list of couples (jpni, jpnj) all cases ending up with more subdomains than N sub max provided by the previous step.
The final part of this second step is to build the list of optimal decompositions, each one defined by a couple (jpni, jpnj), 135 with a number of subdomains (jpni × jpnj) ranging from 1 to a maximum of N sub max . This work is done with an iterative algorithm starting with the couple (jpni, jpnj) = (1, 1). The recurrence relation to find element N + 1 knowing element N of the list of optimal decompositions is the following: first, we keep only the couples (jpni, jpnj) which maximum subdomain size is smaller than the maximum subdomain size found for the element N . Next, we define the element N + 1 as the couple (jpni, jpnj) that gives the smallest number of subdomains (jpni × jpnj). It happens rarely that several couples (jpni, jpnj) 140 correspond to this definition. In this case, we decided to keep the couple with the smaller subdomain perimeter, so the volume of data exchanged during the communications is minimized. This choice is quite arbitrary as NEMO scalability is limited by the number of communications but not by their volume. Limiting the subdomain perimeter should therefore have a very limited effect. We stop the iteration process once there is no more couple with a subdomain size smaller than the one selected at rank N .

145
Once this list of the best domain decomposition sorted from 1 to N sub max subdomains is established, we just have to follow the third step of our algorithm to get the best subdomain decomposition (see section 2.1.1).

The BENCH configuration
Once we found the best domain decomposition, exploring the code numerical performance requires a proper benchmark that is easy to install and configure but, at the same time, keeps the code usage close to the production mode.

150
The NEMO framework proposes various configurations based on a core ocean model: e.g. global (ORCA family) or regional grids, with different vertical and horizontal spatial resolution. Different components can moreover be added to the ocean dynamical core: sea-ice (i.e. SI3) and/or bio-geo-chemistry (i.e. TOP-PISCES). A comprehensive performance analysis must thus be able to scrutinise the subroutines of every module sets to deliver a relevant message about the computing performance of the NEMO model to its whole users community. On the other hand, as pointed out by the former RAPS consortium (Mozdzynski, 155 2012), performing such benchmarking exercise must be kept simple, since it is often done by people with basic knowledge of NEMO or physical oceanography (e.g. HPC experts and hardware vendors).
We detail in this section a new NEMO configuration we specifically developed to simplify future benchmarking activities by responding to the double constraint of (1) be light and trivial to use and (2) allow to test any NEMO configuration (size, periodicity pattern, components, physical parameterizations...) At the opposite of the dwarf concept (Müller et al., 2019), this 160 configuration encompasses the full complexity of the model and helps to address the current issues of the community. This BENCH configuration was used in (Maisonnave and Masson, 2019) to assess the performances of the global configuration (ORCA family). We use it in the current study section to continue this work and improve now the performances of the regional configurations.

165
This new configuration, called BENCH, is made available in the tests/BENCH directory of the NEMO distribution.
BENCH is based on a flat bottom rectangular cuboid, which allows to by-pass any input configuration file and gives the possibility to define the domain dimensions simply via namelist parameters (nn_isize, nn_jsize and nn_ksize in namusr_def ).
Note that the horizontal grid resolution is fixed (100km) whatever grid size is defined, which limits the growth of instabilities and allows to keep the same time step length in all tests.

170
Initial conditions and forcing fields do not correspond to any reality and have been specifically designed (1) to ensure the the maximum of robustness and (2) to facilitate BENCH handling as they do not require any input file. In consequence, BENCH results are meaningless from a physical point of view and should be used only for benchmarking purposes. The model temperature and salinity are almost constant everywhere with a light stratification which keep the vertical stability of the model. We add on each point of each horizontal levels a very small perturbation which keeps the solution very stable but lets 175 the oceanic adjustment processes occurring and the associated amount of calculations at a realistic level. This perturbation also guarantees that each point of the domain has a unique initial value of temperature and salinity, which facilitates the detection of potential MPI issues. We apply a zero forcing at surface, except for the wind stress which is small and slowly spatially varying.
To make it as simplest as possible to use, the BENCH configuration does not need any input files, so that a full simulation can be lead without any other download but the source code. In addition, the lack of input (or output) files prevents any disk access perturbation of our performance measurement. However, output can be activated as in any NEMO simulation, for example to make possible a performance evaluation of the XIOS (Meurdesoif, 2018) output suite.

BENCH flexibility
Any NEMO numerical schemes and parameterizations can be used band be used in BENCH. To help the user in his namelist choices and to test diverse applications, a selection of namelist parameters is provided with BENCH to confer to the benchmarks 185 the numerical properties corresponding to popular global configurations: ORCA1, ORCA025 and ORCA12.
The SI3 sea-ice and TOP-PISCES modules can be activated or not by choosing the appropriate namelist parameters and CPP keys when compiling. The initial temperature/salinity definition was designed in such a was that if SI3 is activated in the simulation, the sea ice will cover about 1/5 of the domain, which corresponds approximately of the annual ratio of ORCA ocean grid points covered by sea-ice.

190
Any closed or periodical conditions can be used in BENCH and specified through a namelist parameter (nn_perio in namusr_def ). User can for example, use closed boundaries, East-West periodical conditions, bi-periodic conditions (nn_perio = 7, to make sure that all MPI subdomains have exactly 4 neighbours) or even the north pole folding (nn_perio = 4 or 6) used in the global configuration ORCA with its bi-polar grid in the northern hemisphere. The specificity of the periodical conditions in ORCA global grids have a big impact on performance, which motivated the possibility to use them in new BENCH config-195 uration: a simple change of nn_perio definition gives to BENCH the ORCA periodic characteristics and reproduces the same communication pattern between subdomains located on the Northernmost part of the grid.
The previous section describes BENCH main features that are set by default but we must keep in mind that each of them can be modified through namelist parameters if needed. By default, BENCH is not using any input file to define the configuration, the initial conditions and the forcing fields. We could however decide to specify some input files if it was requited by the 200 functionality to be tested.

BENCH grid size, MPI domain decomposition and land only subdomains
Like in any other NEMO configuration, BENCH computations can be distributed on a set of MPI processes. BENCH grid size is defined in the namelist with 2 different options. If nn_isize and nn_jsize are positive, they simply define the total grid size on which the domain decomposition will be applied. This is the usual case, the size of each MPI subdomains is comparable 205 (but not necessary equal) for each MPI task and computed by the code according to chosen pattern of domain decomposition.
If nn_isize and nn_jsize are negative the absolute value of these parameters will no more define the global domain size but the MPI subdomains size. In this case, the size of the global domain is computed by the code to fit the prescribed subdomains size. This options forces each MPI subdomains to have the exact same size whatever number of MPI processes is used, which facilitates the measurement of model weak scalability.

210
In NEMO, calculations are processed, in a large majority, at each grid point and a mask is next applied to take into account land grid points, if any. Consequently, the amount of calculations is the same with or without bathymetry, and the computing performance of BENCH and a realistic configuration are extremely close. However, we must underline that the absence of configuration, same grid than global ocean at 1 degree resolution (ORCA1), including SI3 sea-ice subroutines, without any bathymetry (standard case, red line), or with land only subdomains removal (dark blue line). In each case, the time spent waiting boundary conditions is also plotted (resp. orange and light blue) continents in the default usage of BENCH, prevent to test is the removal of subdomains entirely covered by land points (Molines, 2004). Note that if we want to test a realistic bathymetry and the removal of land only subdomains in BENCH, we 215 can use any NEMO configuration input file in BENCH (as in all NEMO configurations). In this case, we just have to define ln_read_cf g = .true. and provide the configuration file name in the variable cn_domcf in the namelist block namcf g.
A comparison of the BENCH scalability, without (named STD) or with land only subdomains removal (named LSR) was performed to assess the removal impact.
The subdomain size of configurations that remove or not land only subdomains, if decomposed with the same number of 220 MPI processes, is different. A fair comparison of computing performance must be done for identical subdomain size. This imbalance. The comparison of waiting time in STD and LSR helps to understand the origin of the small overall performance discrepancy. Most of it comes from LSR shorter waiting, which could be explained by the smaller amount of communication between processes, considering that in this configuration, some subdomains have boundaries with land only subdomains, thus no communication. However, scalability regimes, slopes and limits, in STD and LSR, for computations as for communications, are practically the same. As already mentioned (Ticco et al., 2020), BENCH is able to reproduce computation performance of any usual NEMO configuration, and gives a simplified alternative for benchmarking work.
Since the 4.0 version release, the BENCH configuration was tested on various platforms and showed good numerical stability properties (Maisonnave and Masson, 2019

Dedicated tool for communication cost measurement
The last piece of the puzzle that we implemented to set up an effective benchmarking environment in NEMO is to collect and summarizes information related to the performances of the MPI part in the code.
In NEMO, MPI communications are wrapped in a small number of high level routines co-located in a single Fortran file (lib_mpp.F 90). These routines provide functionalities for horizontal halo exchange and global averages or min/max operations 245 between all horizontal subdomains. With very little code changes in this file, it is possible to identify and characterise the whole MPI communication pattern. This instrumentation does not replace external solutions, based on automatic instrumentation, e.g.
with "Extrae-Paraver" (Prims et al., 2018) or "Intel Trace Analyzer and Collector" (Koldunov et al., 2019), which provide a comprehensive timeline of the exchanges. The amount of information collected by our solution is much smaller, and the possibilities of analysis reduced, but we are able to deliver without any external library, without additional computing cost or 250 additional postprocessing, a simplified information for any kind of machine (portability).
A counter of the time spent in MPI routines ( i.e. "waiting time", as defined in subsection 2.2.3), is incremented at any call of a MPI send or receive operation in one hand, and at any MPI collective operations on the other hand. Symmetrically, a second counter is filled outside these MPI related operations. On each process, only two data are collected into two floats: the cumulative time spent sending/receiving/gathering or waiting MPI messages and the complementary period spent in other 255 operations (named "computation time").
After a few time steps, we are able to produce in a dedicated output file called communication_report.txt, the listing of the subroutines exchanging halos and how many times they did, on one hand, and the listing of the subroutines using collective communications on the other hand. The total number of exchanged halos, the number of 3D halos, and their maximum size are also provided.

260
At the simulation end, we also produce a separated counting, per MPI process, of the total duration of (i) halo exchanges for 2D/3D and simple/multiple arrays, (ii) collective communications needed to produce global sum/min/max, (iii) any other model operation independent from MPI (named "computing") and (iv) the whole simulation. These numbers exclude the first and last time steps, so that any possible initialization or finalization operations were excluded of the counting. This analysis is output jointly to the existing information related to the per-subroutine timing (timing.output file).

Reducing or removing unnecessary MPI communications
This third section presents the code optimizations that were done following the implementation of the benchmarking environment described in the second section.
In a former study (Maisonnave and Masson, 2019), we relied on the measurement tool (section 2.3) to assess the performance of the BENCH configuration (section 2.2). In this work, they focused only on global configurations with grid sizes equivalent 270 to 1 • to 1/12 • horizontal resolution, including or not sea-ice and bio-geo-chemistry modules. Our analysis revealed that the global configurations scalability was limited by a major load imbalance due to the special halo exchange required by the north fold treatment. A fine study of the north fold communications pattern revealed that it was possible to further reduce the number of array lines involved in these specific communications, achieving a speed up of x1.4 at a new scalability limit.
Maisonnave and Masson (2019) also modified various NEMO subroutines to reduce (or group) MPI exchanges, like com-275 municating a 3D halo in a single step instead of a sequence of 2D halos, or avoiding collective communication (i) during the prognostic variable sanity check, (ii) during the computation of a convergence criterion in sea-ice thermodynamics, and (iii) during bio-geo-chemistry tracer advection to spread negative value locally.
In this section, we propose to complement the work of Maisonnave and Masson (2019) by improving NEMO scalability in regional configurations instead of global configurations. This implies to configure the BENCH namelist in a regional setup 280 including open boundaries, as detailed in the annex A. We also deactivated the sea-ice component which was deeply rewritten in NEMO version 4 and necessitates a dedicated work on its performance.
As evidenced by Tintó et al. (2019), the MPI efficiency in NEMO is not limited by the volume of data to be transferred between processes but by the number of communications itself. Regrouping the communications that cannot be removed is therefore a good strategy to improve NEMO performances. Our goal is thus to find the parts of the code which do the most 285 of communications and next figure out how we can reduce this number either by removing communications that appear to be useless or by regrouping as much as possible the indispensable communications. Note that following the development of

Free Surface Computation Optimization
In most of the configurations based on NEMO, including in our BENCH test, the surface pressure gradient term in the prognostic ocean dynamics equation is computed using a "Forward-Backward" time-splitting formulation (Shchepetkin and McWilliams, 2005). At each time step n of the model, a simplified 2D dynamic is resolved at a much smaller sub time step ∆t * resulting in m sub time step per time step, with m ranging from 1 to M (≈ 50). This 2D dynamic will then be averaged to obtain the surface pressure gradient term.

300
In the previous NEMO version, each sub-time step completes the following computations : With η the sea surface height, U h the speed integrated over the vertical,Ũ h the flux, D m = H + η m the height of the water column, ∆e the length of the cell, P precipitation, E evaporation, g the gravity acceleration, f the Coriolis frequency, k the vertical unit vector, G a forcing term, β, r, δ, γ and are constants 305 NEMO uses a staggered Arakawa C grid, meaning that, among other, zonal velocities are evaluated at the middle of the eastern grid edges, meridional velocities at the middle of the northern grid edges and sea surface height at the grid center.
Due to this feature, spatial interpolations are sometimes needed to get variables on other points that they are initially defined.
For instance, η m+ 1 2 must be interpolated to U and V points to be used in the equationŨ h  can not be removed altogether as the variable is also used for other purposes that are not detailed here.
Following this improvement, the number of communications per sub-time step in the time-splitting formulation has been reduced from 3 to 2. This translates in a reduction from 135 (44%) to 90 (29%) communications per time step in the surface 325 pressure gradient routine for the examined configuration.

Open Boundaries Communication Optimisation
Configurations with open boundaries require to frequently correct fields on the boundaries. In the previous NEMO version, Boundary conditions in NEMO are often based on the Neumann condition, ∂φ ∂n = 0 where φ is the field on which the condition is applied and n the outgoing normal, or the Sommerfeld condition ∂φ ∂t + c ∂φ ∂n = 0 where c is the speed of the transport through the boundary. For both boundary conditions the only spatial derivative involved is ∂φ ∂n . The main focus is therefore to find the best way to compute ∂φ ∂n in various cases. boundaries along domain edges are far more common and easier to address, we will examine them first.   Unstructured open boundaries allow the user to define any boundary shape. In figure 5 the best choice is to take the average of the values of the closest available points. Here, applying the Neumann condition is done by setting φ(x i , y i ) to φ(xi+1,yi)+φ(xi,yi−1) 2 . Indeed, if only one of those two points was used, there would 385 not be good symmetry properties and if the top right point was also taken into the average, the result would be a bit better in case 1 for a non linear field but worse in cases 2, 3 and 4.

Straight Open Boundaries along domain edges
Using the same method, we finally summarized all Neumann conditions and their neighbour contributions in 5 cases showed in Figure 6. All other possible dispositions are only rotations of one of these 5 cases. Thanks to this classification, we have next been able to figure out, from the model initialisation phase, the communication pattern required by the treatment of the 390 Neumann conditions and therefore restrict the number of communication phases to their strict minimum according to the chosen MPI subdomain decomposition.

Performance improvement
The computing performance is estimated using a realistic simulation of the West Atlantic between 100 • West and 31 • West and  model. Figure 8 shows that there are many outliers in the computing time that shift the mean to higher values while the median is not sensible to it. As the exact same instructions are run at every time step, the outliers are likely to be a consequence of 405 instabilities of the supercomputer or preemption. Since the median effectively filters out outliers it corresponds to the computing time one would get on a perfectly stable supercomputer with no preemption. Figure 9 shows the improvement of NEMO strong scaling brought by our optimizations. In this figure, the model performance is quantified by the means of the number of simulated years that can be simulated during a 24-hour period (Simulated Years Per Day, SYPD). For small numbers of core, the optimizations have no noticeable effect as the time spent in the com-410 munication phases is, in any case, very small. However, as the number of cores rises, each MPI subdomain gets smaller, the Filtering out the outliers by using the median gives results that are better and more robust. The scalability curves (dashed lines) are less noisy and, if we except the last point, the distance between the two dashed lines is steadily increasing as we use more cores. The impact of our optimizations is therefore more important at high numbers of core. The number of SYPD is, for example, increased by 35% for 37, 600 cores when using the optimized version. One can also note that the optimized version ran on 23, 000 cores simulates the same number of SYPD than the old version on 37, 600 cores which represents reduction of 420 resources usage by nearly 40%.
The differences between the unfiltered and the filtered results can be understood as a consequence that each core has similar chances of suffering from instabilities or going into preemption, hence, at high numbers of core, unusually slow cores are more common. Since communications tend to synchronise cores, a single slow core slows down the whole run. The median gets rid of time steps in which preemption or great instabilities occur. Its indicates the performances we could get on a "perfect" 425 machine which would not present any "anomalies" during the execution of the code. This machine is unfortunately not existing and the trend in the new machines architecture with always more complexity and heterogeneity suggests that performances "anomalies" during the model integration may occur more and more often to become a common feature. Our results point out this new constrain which is already already eroding a significant part of our optimisation gains (from 40% to 20%) and will require to be taken into account in the future optimizations of the code.

Conclusions
We presented in this paper the new HPC optimisations what have been implemented in NEMO 4.0, the current reference version of the code. The different skills we gathered among the co-authors allowed us to improve NEMO performances while facilitating its use. The automatic and optimized domain decomposition is a key feature to perform a proper benchmarking work but it also benefits to all users by selecting the optimum use of the available resources. This new feature also points out 435 possible waste of resources to the users, making them aware of the critical impact of the choice of the domain decomposition on the code performance. The new BENCH test case results from a close collaboration between ocean physicist and HPC engineers. We distorted the model input, boundary and forcing conditions in such way that the model is stable enough to do benchmark simulations in any configuration with the less possible input files as possible (basically the namelist files). Note that the stability of this configuration is even allowing developers to carry out some unorthodox HPC tests as, for example,   in the functioning of the supercomputers will happen more and more as we are using more and more cores. Code optimisations will have to take this new constrain into account. We will have to adapt our codes in order to absorb (or at least limit the effects) of the asynchrony that will appear during the execution on the different cores at a growing frequency. The way we perform the communication phase in NEMO, with its point-to-point synchronisation between, first, east-west neighbours and, next, north-south neighbours will have to be revisited for example with non-blocking send and receive or with new features such 465 as neighbour collective communications. In recent architecture, the number of cores inside a node has increased. This leads to a two-level parallelism where the communication speed/latency differs for inter-node exchanges and intra-node exchanges.
Inter-nodes communications are probably slower and are probably the source of more asynchrony than the inner-nodes communications. A possible optimisation could therefore be to minimize the ratio of inter-node versus inner-nodes communications.
The figure 10 illustrates this idea. The domain decomposition of NEMO is represented with each square being an MPI pro-470 cesses and each node a rectangle of the same colour. In this representation, the perimeter of a "rectangle-node" directly gives the number of inter-node communications (if we say to simplify that the domain is bi-periodic and each MPI process has always 4 neighbours). If we make the hypothesis that that each node can host x 2 MPI processes, these x 2 processes are placed by default on a "line-node" with a perimeter of 2x 2 + 2. In this case, the ratio of inter-node versus inner-nodes communications is (2x 2 + 2)/4x 2 . Now, if we distribute the x 2 processes on a "square-node", the number of inter-node communications becomes 475 4x and the ratio 4x/4x 2 = 1/x. If we take x = 8 for 64-core nodes, we get a reduction of -75% of the number of inter-node communications when comparing the "line-node" dispatch (130) with the "square-node" dispatch (32). The ratio of inter-node versus inner-nodes communications drops from 50% to 12.5%. These numbers are of course given for ideal cases where the number cores per node is a square. We should also consider additional constrains like the removal of MPI processes containing only land points, the use of some of the cores per node for XIOS, the IO server used and NEMO. The optimal dispatch of the 480 MPI processes in a real application is maybe not so trivial but there is certainly here an easy way to minimize the inter-node communications, which could be an advantage when we will be confronted with the occurrence of more and more asynchrony. Figure 10. NEMO domain decomposition with the dispatch of the MPI processes among the different nodes. In this schematic representation, each square represents one MPI process of NEMO. We consider that NEMO is distributed over 216 MPI processes and that each node has 9 cores. The distribution of the MPI processes on the cores is represented by their colour: processes on the same node have the same colour.