Parallel computing efficiency of SWAN

. Effective and accurate ocean and coastal wave predictions are necessary for engineering, safety and recreational purposes. Refining predictive capabilities is increasingly critical to reduce the uncertainties faced with a changing global wave climatology. Simulating WAves in the Nearshore (SWAN) is a widely used spectral wave modelling tool employed by coastal 10 engineers and scientists, including for operational wave forecasting purposes. Fore-and hindcasts can span hours to decades and a detailed understanding of the computational efficiencies is required to design optimized operational protocols and hindcast scenarios. To date, there exists limited knowledge on the relationship between the size of a SWAN computational domain and the optimal amount of parallel computational threads required to execute a simulation effectively. To test this, a hindcast cluster of 28 computational threads (1 node) was used to determine the computation efficiencies of a SWAN model 15 configuration for southern Africa. The model extent and resolution emulate the current operational wave forecasting configuration developed by the South African Weather Service (SAWS). We implemented and compared both OpenMP and the Message Passing Interface (MPI) distributing memory architectures. Three sequential simulations (corresponding to typical grid cell numbers) were compared to various permutations of parallel computations via the speed-up ratio, time saving ratio and efficiency tests. Generally, a computational node configuration of 6 threads produced the most effective computational 20 set-up based on wave hindcasts of one-week duration. The use of more than 20 threads resulted in a decrease in speed-up ratio for the smallest computation domain, owing to the increased sub-domain communication times for limited domain sizes.


Introduction
The computational efficiency of Met-ocean (Metrological-Ocean) modelling has been the topic of ongoing deliberation for 25 decades. The applications range from long-term atmospheric and ocean hindcast simulations to the fast responding simulations related to operational forecasting. Long-duration simulations are usually associated with climate change related research, with simulation periods of a least 30-years across multiple spatial and temporal resolutions needed to capture key oscillations (Babatunde et al., 2013). Such hindcasts are frequently used by coastal and offshore engineering consultancies for purposes https://doi.org/10.5194/gmd-2020-314 Preprint. Discussion started: 2 December 2020 c Author(s) 2020. CC BY 4.0 License. such as those related to infrastructure design (Kamphuis, 2020), or environmental impact assessments (Frihy, 2001;Liu, Sheu, 30 & Tseng, 2013).
Operational (or forecasting) agencies are usually concerned with achieving simulation speeds that would allow them to accurately forewarn their stakeholders of immediate, imminent and upcoming met-ocean hazards. The main stakeholders are usually other governmental agencies (e.g. disaster response or environmental affairs departments), commercial entities and the public. Both atmospheric and marine forecasts share similar numerical schemes that solve the governing equations and thus 35 share a similar need in computational efficiency. Fast simulation times are also required for other forecasting fields such as hydrological dam-break models (e.g. Zhang, et al., (2014)) . Significant advancement in operational forecasting can be made by examining the way in which the code interfaces with the computation nodes, and how results are stored during simulation.
Numerous operational agencies (both private and public) makes use of Simulating Waves in the Nearshore (SWAN) to predict nearshore wave dynamics (refer to Genseberger & Donners, (2020) for details regarding the SWAN numerical code and 40 solution schemes). These agencies include the South African Weather Service (e.g. ), MetOcean Solutions (a division of the Metrological Office of New Zealand) (e.g. de Souza, et al., (2020)), the United Kingdom MetOffice (e.g. O'Neill et al., (2016)) and the Norwegian Metrological Service (e.g. Jeuring, et al., (2019). In general, these agencies have substantial computational facilities but nonetheless still face the challenge of optimizing the use of their computational clusters between various models (being executed simultaneously). These models may include atmospheric models (e.g. the 45 Weather Research and Forecasting (WRF) model), Hydrodynamic models (e.g. Regional Ocean Modeling System (ROMS) and the Semi-implicit Cross-scale Hydroscience Integrated System Model (SCHISM)) and spectral waves models (e.g. Wave Watch III (WW3) and SWAN (Holthuijsen, 2007;The SWAN Team, 2006)). Holthuijsen, (2007) presents a theoretical background to the spectral wave equations, wave measurement techniques and statistics as well as a concluding chapter linking the book to the SWAN numerical model. There must also be a balance between hindcast and forecast priorities and client 50 needs. Some of these agencies use a regular grid (instead of irregular grids (e.g. Zhang, et al., (2016)), with nested domains in many of their operational and hindcast projects. Here we focus only on the computational performance of a structured regular grid (typically implemented for spectral wave models). Kerr et al., (2013) performed an inter-model comparison of computational efficiencies by comparing SWAN, coupled with ADCIRC, and the NOAA official storm surge forecasting model, SLOSH, however, did not investigate the optimal thread 55 usage of a single model. Other examples of a coupled wave and storm surge model computational benchmarking experiments include Tanaka, et al, (2011) andDietrich et al., (2012) for a unstructured meshes during Hurricanes Katrina, Rita, Gustav and Ike in the Mexican Golf. These models also present their results on a log-log scale and their experimental design tested computational thread numbers not easily obtainable by smaller agencies and companies. The latter rather require sequential versus paralleled computational efficiencies using smaller scale efficiency metrics. Genseberger & Donners, (2015), explored 60 the scalability of SWAN using a case study focused on the Wadden Sea in the Netherlands. By investigating the efficiency of both the OpenMP (OMP) and MPI version of the then current SWAN, they found that the OpenMP was more efficient on a https://doi.org/10.5194/gmd-2020-314 Preprint. Discussion started: 2 December 2020 c Author(s) 2020. CC BY 4.0 License. single node. They also proposed a hybrid version of SWAN, to combine the strengths of both implementations of SWAN: using OpenMP to more optimally share memory and MPI to distribute memory over the computational nodes.

65
Here we build on the case study of Genseberger & Donners using results produced in the present study for southern Africa, to answer the following research questions: 1) when using SWAN, is it always better to have as many threads as possible available to solve the problem at hand? 2) What is the speed-up relationship between number of threads and computational grid size ? 3) At what point (number of threads) does the domain sub-communications start to make the whole computation less effective? 4) What is the scalability of a rectangular grid, SWAN set-up? 70

Methodology and background
Details of the model configuration can be found in ) and (Rautenbach, et al., 2020 (b)). The computational domain and physics used here were the same as presented in those studies. All computations were performed on Intel Xeon E5-2670, 2.3GHz computational nodes. Twenty-eight threads (cores) each with 96 GB RAM were used. SWAN 40.91 was implemented with the Van der Westhuysen whitecapping formulation (van der Westhuysen, et al., 2007) and Collins 75 bottom friction correlation (Collins, 1972) with a coefficient value of 0.015. Fully spectral wave boundary conditions were extracted from a global Wave Watch III model at 0.5 geographical degree resolution.
Here, the validation of the model was not the main aim but rather the relative computational scalabilities, as described at the end of the previous section. However, it should be noted that no nested domains were employed during the present study. Only the parent domain was used as a measure for scalability. The computational extent given in Rautenbach, Barnes, et al., (2020) 80 (a) and  (b) contains numerous non-wet grid cells that are not included in the computational expense of the current study. In Table 1, the size of the computational domain and resolution, together with the labelling convention are given. For clarity, we define the resolutions as low, medium and high, denoted L, M and H, respectively, in the present study (noting that given the domain size, these resolutions would be classified as intermediate to high regional resolution for operational purposes). 85 The test for scalability ability of a model used here was the ability to respond to an increased number of computations with an increasing amount of resources. In the present study these resources are computational threads. An arbitrary week of 90 computations were performed to assess model performance. Model spin-up was done via a single stationary computation. The rest of the computation was performed using a non-stationary computation using an hourly time-step. This implied wind-wave generation within the model occurred on the timescale of the wind forcing resolution. The grid resolutions used in the present study corresponded to 0.1, 0.0625 and 0.05 geographical degrees. Local bathymetric features were typically resolved through downscaled, rotated, rectangular grids, like the methodology employed by . A nested resolution 95 increase of more than 5-times is also not recommended (given that the regional model is nested in the global Wave Watch III output at 0.5 geographical degree resolution, refer to ) (a). Given these constraints, these resolutions represent realistic and typical SWAN model set-up, for both operational and hindcast scenarios.
The three main metrics for estimating computational efficiency are: the Speed-up, Time saving and Efficiency ratios. The 100 fourth parameter, and arguably the most important, is the Scalability and is estimated using the other three parameters as metrics.
The Speed-up ration is given as: where 1 is the time in seconds it takes for a sequential computation on one thread and is the time a simulation takes with p computational threads (S. Zhang et al., 2014) . 105 The Time saving ratio is given by: and the Efficiency ratio follow with the same variables definitions as: The Scalability of SWAN was tested based on the Speed-up ratios for the grid resolutions in Table 1. Collection (gcc) 7.2.0 and linked with OpenMPI and Intel C++ compilers with Intel MPI for relatively small computational problems. Their numerical computation considered models with 600K, 300K and 150K grid cell sizes (what they called matrix size). These computational grid sizes were deemed "small", but they still acknowledged the significant computational https://doi.org/10.5194/gmd-2020-314 Preprint. Discussion started: 2 December 2020 c Author(s) 2020. CC BY 4.0 License. resources required to execute geographical models of this size due to the large number of time steps usually involved to solve these problems. 115 From a practical point of view, regular SWAN grids will rarely be used in dimensions exceeding the resolutions presented in the previous section. The reason for this statement is twofold: 1) to downscale a spectral wave model from a global resolution to a regional resolution may not exceed a five-times refinement factor and 2) when reasonably higher resolutions are required in the nearshore (to take complex bathymetric features into account), nested domain are preferred. The reasoning will be 120 different for an unstructured grid approach (Dietrich et al., 2012) . Given these limitations on the widely used structured SWAN grid approach, SWAN grids will almost exclusively be deemed as a low spatial computational demand model. Small tasks create a sharp drop in performance via the Intel C++ compiler due to the "work stealing" algorithm, aimed at balancing out the computational load between threads (Zafari et al., 2019) . In this scenario, the threads compete against each other resulting in an unproductive simulation. In our experiments, each task performed via Intel was approximately 13-times faster but the 125 overall performance was 16-time slower than the equivalent ggc compiled version of the compiled shallow water model presented by Zafari et al. (2019) .

Results
In Figure 1, the computational scalability of SWAN is given as a function of number of computational threads. In Figure 1 (a) the computational time in seconds is presented. Here the model resolutions grouped together with not much differentiation 130 between them. These results also highlight the need for performance metrics, like described in the previous section. From

(b) Efficiency (Equation (3)), (c) Speed-up ratio (Equation (1)) and (d) the Time saving ratio (Equation (2)).
Near linear speed up is observed for a small number of computational threads. This agrees with the results reported by Zafari et al., (2019). In Figure 1 (d) the same results are obtained via the time saving ratio. Here a clear and district flattening down is observed with thread counts larger than approximately 6. 145 https://doi.org/10.5194/gmd-2020-314 Preprint. Discussion started: 2 December 2020 c Author(s) 2020. CC BY 4.0 License.

Discussion
The behaviour noted in the results is similar to the dam breaking computational results reported by S. Zhang et al., (2014). Genseberger & Donners, (2020) presents the latest finding on the scalability and benchmarking of SWAN. However, their focus was quantifying the performance of their new hybrid version of SWAN. In their benchmarking experiments (for the Wadden Sea, in the Netherlands), they obtained very similar results to Figure 1 (a), with OMP producing faster wall-clock 150 computational times. They also considered the physical distances between computational threads and found that this parameter has a negligible effect compared to OMP vs MPI, over an increasing number of threads. Their benchmarking also differed from the results presented here as they only provided results as a function of node number. Each one of their nodes consisted of 24 threads. In the present study, the benchmarking of a single node (28 threads) is evaluated compared with a serial computation on a single thread. For benchmarking, without performance metrics, they found that the wall clock times, for the 155 iterations and not a full simulation, reached a minimum (for large computational domains) at 16 nodes (16 × 25 threads) for the MPI SWAN and 64 nodes (16 × 24 threads) for the hybrid SWAN. These results were based on using the Cartesius 2690 v3 (Genseberger & Donners, 2020). With the hybrid SWAN, the optimal wall-clock time turn point, for iterations, increased with increased number of computational cells. All the reported turn points (optimal points) occurred at node counts well above 4 nodes (4 × 24 threads). The wall-clock performance estimation of Genseberger & Donners, (2015) did however indicate 160 similar result to those presented in Figure 1 (a), with OMP running faster than MPI. It must still be noted that with an increased number of nodes, and thus threads, the total computational time should continue to decrease up until the point where the internal domain decomposition, communication efficiencies, starts to outweigh the gaining of computational power. Based on results of Genseberger & Donners, (2020), we can estimate that, for our node configuration and region of interest, the communication inefficiencies will become dominant at approximately 16 nodes (16 × 24 threads). 165

Conclusion
The present study investigated the scalability of SWAN, a widely used spectral wave model. Three typical wave model resolutions were used for these purposes. Both the OpenMP (OMP) and the Message Passing Interface (MPI) implementations of SWAN were tested. The scalability is presented via three performance metrics: the efficiency, speed-up ratio and the timesaving ratio. The MPI version of SWAN outperformed the OMP version based on all three metrics. The MPI version of 170 SWAN performed best with the largest computational domain resolution, resulting in the highest speed-up ratios. The time saving ratio indicated a decrease after approximately six computational threads. This result suggests that six threads are the most effective configuration for executing SWAN. The largest increases in speed-up and efficiency was observed with small thread counts. According to Genseberger & Donners, (2020), computational times decrease up to ~16 nodes (16 × 24 threads), indicating the wall-clock optimal computational time for their cases study. This result suggests that multiple nodes will be 175 required to reach the optimal wall-clock computational timeeven though this turn point might not be the most efficient computational configuration. Ultimately, the efficiencies recommended here can improve operational performance https://doi.org/10.5194/gmd-2020-314 Preprint. Discussion started: 2 December 2020 c Author(s) 2020. CC BY 4.0 License.
substantially, particularly when implemented over the range of modelling software needed to produce useful metocean forecasts.

Code/Data availability 180
The open source version of SWAN was run for the purposes of the present study. SWAN maybe be downloaded from here: http://swanmodel.sourceforge.net/. The bathymetry used for the present study may be downloaded here: https://www.gebco.net/ and the wind forcing may be found here: https://climatedataguide.ucar.edu/climate-data/climateforecast-system-reanalysis-cfsr.