Inﬂuence of the compiler on multi-CPU performance of WRFv3

The Weather Research and Forecasting system version 3 (WRFv3) is an open source and state of the art numerical regional climate model used in climate related sciences. Over the years the model has been successfully optimized on a wide variety of clustered compute nodes connected with high speed interconnects. This is currently the

model performance counts.Thus, the performance of the Regional Climate Model (RCM) Weather Research and Forecast (WRF) system is evaluated in this paper.More precisely the benchmark was done for one of the two available numerical solvers of the WRF: the Advanced Research WRF (ARW) version 3. Since access was restricted to Tornado, a High Performance Cluster (HPC) of the Deutsche Klima-Rechenzentrum (DKRZ), an inter-comparison study between different HPC hardware was not possible.Either way such a study would not be as interesting at all, because many hardware-related studies are already available for WRF.Most of them are collected by WRF developer John Michalakes of the National Center for Atmospheric Research (NCAR), and made publicly available on the central web page www.mmm.ucar.edu/wrf/WG2/benchv3.The page is updated at least yearly with new benchmark measurements since 2008.Instead of such a hardware-related study the performance of WRF was evaluated for different software influencing WRFs performance.This includes different versions of WRF, the Message Passing Interface (MPI) library, and the compiler.This was partly done before, for example by Shainer et al. (2009), by Roe and Stevens (2010), or by Newby and Morton (2008), with a focus on MPI and network issues.It was also done lately in two presentations with an additional focus on Intel versus the GNU Compiler Collection (GCC) and Intel versus PGI compilers by the HPC Advisory Council (December 2010) and Sankaran (October 2010), respectively.Both benchmarks were done on HPCs with Intel CPUs, the latter additionally on AMD CPUs.
For this paper a benchmark suite was implemented on the Tornado HPC which operates AMD CPUs of another series, thus the results allow a relative comparison with the results of Sankaran and the HPC Advisory Council only.As an extension to their presentations this paper compares three major compilers instead of two on a single machine running WRFv3, and documents the whole implementation process in such a detailed way that other users should be able to conduct comparable benchmarks on their systems.

Important software influencing WRFs performance
The compiler is the most important performance-related software for any program which the user can compile.The compilation process translates the source code written in a programming language like C or Fortran into a machine-readable binary format.This process enables the performance optimization of the software for different hardware.
In case of software that can take advantage of HPCs due to the time-parallel use of many CPUs (like WRF) the MPI library is a second important software influencing WRFs performance.MPI is responsible for the efficient communication between the CPUs, which often is a bottleneck due to a limited transfer bandwidth of their connection.Hence, four different MPI versions available on Tornado were compared in this performance evaluation.
The third important software is WRF itself.It is not always necessary to use the most recent version in regard of new features or model accuracy because they may not be needed in specific research questions.In this case one should always just use the fastest version.One can expect older, less complex versions to perform better, but one also can expect newer versions to perform better because of source code optimizations.Therefore the two ARW versions 3.1.1and 3.2.1 were evaluated.
The fourth parameter influencing the performance is set in the configuration file configure .wrf,which will be generated right before the compilation.One can set in the configure.wrfthe option to compile WRF for single or multi-CPU hardware.Since practical application cases of WRF running on a single-CPU are rare -especially if an HPC is available -only a multi-CPU option was of interest.WRF offers three of them: the first multi-CPU option is named smpar, working with an OpenMP shared memory thread paradigm, not to be confused with MPI.Second option is dmpar and works with an MPI task distributed memory paradigm.Thirdly a hybrid approach exists, that combines MPI and OpenMP, where each MPI task spawns a number of threads to utilize shared memory in a node.But as already shown by Morton et al. (2009) Advisory Council (December 2010) and others for WRF the dmpar case often outperforms the hybrid and the smpar cases clearly.However, after Morton et al. (2009) there were groups that found opposing results on other architectures and with other compilers, but only with "slightly" and not clearly better performance of the hybrid approach.So no emphasis is laid on hybrid versus dmpar comparison within this paper.
In a nutshell, this software benchmark suite will consist of an inter-comparison of compiler, MPI library, and WRF version influencing WRFs performance.However, the focus is on the compiler as the most important tool in performance optimization of software.

About the WRF model
WRF is a mesoscale numerical weather prediction system and also a regional to global (experimental stage) climate model allowing simulations reflecting either real data or idealized configurations.It features a 3-dimensional variational data assimilation system, and an open source software architecture allowing for computational parallelism and system extensibility due to the modularity of its components (see Fig. 1).
WRF ships with two dynamical cores, the Nonhydrostatic Mesoscale Model (NMM) and ARW.The used ARW solver is recommended for research questions, while the NMM solver is recommended for operational use in weather forecasting.Both are freely available via the same source code packages at www.mmm.ucar.edu/wrf/srcfor versions 2.0 up to 3. operational use at NCEP, AFWA and other centres (WRF Homepage, 2011).Technical and physical details can be found in the user guide by Wang et al. (2011) and in the technical physical documentation by Skamarock et al. (2008).

The benchmark setup
After describing the benchmark goal, the environmental software components, and the model to benchmark in the last sections, this section will document the detailed setup of the benchmark suite, consisting of specifications of hardware, environmental software, model software, and model domain.

Hardware: specifications of Tornado
The Tornado HPC of the DKRZ is a 2048 core Linux cluster.Core describes a subunit of today's multi-core CPUs, which is capable of processing one task at a time each.The nodes are interconnected via Gigabit Ethernet and a low latency Infiniband network.More details like on storage and login nodes -which are non-crucial components in respect of performance -see Fig. 2 and https://tornado-wiki.dkrz.de/farm/HardwareOverview.

Environmental software: operating system, grid engine, MPI, and compiler
The operating system of Tornado is Debian GNU/Linux compiled with Kernel 2. because it is considered a running system the administrators will not touch it unless absolutely necessary.The installed Sun Grid Engine (SGE) software version 12.1 is responsible for submission and scheduling of jobs from Tornado's login nodes to its compute node queue.After a job submission to the queue MPI starts distributing the job to the requested amount of cores.
MPI libraries are available from different proprietary software manufacturers like Intel and PGI.The preinstalled MPI on Tornado is an open source implementation of MPI, named Open MPI.The installation and benchmarking of another MPI, even of only another version of Open MPI was not a trivial issue.It was partially successful for Open MPI 1.5.1.Altogether the Open MPI versions 1.3.3,1.4.0,1.4.3, and 1.5.1 were evaluated.(Note: Open MPI should not in any way be confused with OpenMP needed for smpar shared memory runs, see Sect.1.2.) The most straightforward compiler for WRF is the GCC.It consists of the C++ compiler gcc, the Fortran compiler Gfortran, and several more.Gfortran and gcc are both needed, because WRFs framework connecting all parts of WRF is written in C++ while all physical modules are written in Fortran.However, only the latter influences the application's performance significantly.This was verified by turning off all optimizations for the C++ code compilation in the configure.wrf(-O0 instead of -O3 for Intel compilers), which did not change the benchmark results.
Besides the open source GCC, proprietary Fortran and C++ compilers of the man- An important prerequisite to compile WRF is an installation of the Network Common Data Format (netCDF) used as input and output format by WRF.Further prerequisites for netCDF and WRF are numerous and usually shipped with the Linux Distribution.
Only the most important packages shall be noted here: zlib, perl, a shell like bash or csh, make, M4.
A specialized version of netCDF that affects the performance in combination with large domains is parallel netCDF (pNetCDF, http://trac.mcs.anl.gov/projects/parallel-netcdf).As discussed in Morton et al. (2010), in its most common usage WRF decomposes its domain in a number of patches or tasks equal to the number of used cores.In this mode all tasks will hold a roughly equal sized sub-domain, and "Task 0 will have the additional responsibility to perform I/O operations and coordinate scatter/gather operations" (Morton et al., 2010, p. 4).This works until domain sizes of several hundred million cells.Because a much smaller domain was used in this case pNetCDF or one of its alternatives also described in Morton et al. (2010) were not needed.As a consequence the common netCDF version 4.01 was used and compiled -once for every compiler manufacturer to avoid software conflicts.The preinstalled Open MPI actually was available in versions for every different main release of a compiler to avoid software conflicts.(Note: to compile with a certain combination of compiler, MPI and netCDF on a system where many of those are available, one first manually has to set all the environmental paths to their installation directories.A detailed online tutorial on those prerequisites and finally the compiling of WRF can be found at www.mmm.ucar.edu/wrf/OnLineTutorial/index.htm.) The two WRF ARW versions 3.1.1and 3.2.1 had to be compiled with special MPIand compiler-options.Those are specific to this case and therefore not documented within the WRF user's guide (Wang et al., 2011)  (see Sect. 1.1).In most cases all flags are automatically chosen correctly by the configure script.But because Open MPI did not recognize the flags -f90=$(SFC) and -cc=$(SCC), those had to be erased from the lines DM FC and DM CC, on the one hand.On the other hand, the flag -DMPI2 SUPPORT had to be added to the line DM CC.Additionally the flag -ip had to be erased from the line CFLAGS LOCAL for versions using an Intel compiler.For PGI and GCC compilers the auto-generated configure .wrfwas sufficient (see Appendix A).  more than 8 cores.This can be explained due to Tornado's limitations in network resources or the so called computations to communications ratio, which was to small for more than 8 cores.To elaborate, the computations to communications ratio describes the phenomenon, that the more cores share the computation of a domain, the more they have to communicate about the physical fluxes between the sub-domains.Even though Tornado's nodes have low latency Infiniband interconnects, the time to communicate those processes was longer than the time needed to process the sub-domain.

Model
(Note: the intra-node communication between the 8 cores of a node is much faster than the inter-node communication using Infiniband and thus represents no bottleneck.) To circumvent the problem with the computations to communications ratio the horizontal resolution (and the time step) of the Default January 2000 case was raised from 30.0 km to 3.333 km or 10 237 752 grid points.Consequently, computation of the resulting domain scaled up to 32 cores.The use of 64 cores still was much slower, sometimes as slow as with 8 cores.This extreme performance hit is an unusual observation that may be unique to Tornado.The administration of Tornado could not fix the problem until now, but are suspecting a cause.They observed that in some cases of high network load, Tornado switches the use of the lower latency Infiniband to the higher latency gigabit ethernet interconnects.The administration of Tornado argues that a new MPI version might fix the problem, which should be available in the near future.However, for this benchmark suite the problem was circumvented by just evaluating results from 32 cores at most.An optional tuning parameter within the namelist.input-not dependent on environmental software -is numtiles.Like elaborated in Sect.2.3, WRF decomposes the domain into tasks or patches each assigned to a core via a MPI process.Each patch can be further decomposed into tiles that are processed separately, but by default there is only one tile (numtile not set or = 1).If the single tile is too large to fit into the cache of the CPU and/or core it slows down computation due to WRF's memory bandwidth sensitivity (Roman, 2009).In order to reduce the size of the tiles, it is possible to increase their number via numtiles = x (see Appendix B).However the optimal value x Introduction

Conclusions References
Tables Figures

Back Close
Full heavily depends on the cache sizes of core and CPU, the number of assigned cores, and the size of the domain.Thus there is no other way than experimentation to find what value gives the best performance.For reference purposes, the best values found here were 64/32/16 for 8/16/32 cores, respectively.This lead to a speed up of 22% for 8 cores rising to 26% for 32 cores.
Altogether, a benchmark setup should not be expensive, speaking of workload for the user and computing time for the CPU.To minimize the duration of a run instead of the default 12 h only 15 min of model time were computed equal to 45 time steps, 20 s each.Further, an intelligent matrix of compiler/WRF/MPI-combinations instead of testing every possible combination (see Table 1 and Sect.4) leads to a small test sample with a maximum of explanatory power.

Method of performance measurement
In most model benchmarks the performance metric is the average time per model time step over a representative period of model integration, ignoring the additional time needed for the model initialization.Alternatively the metric is just the whole runtime including initialization averaged over three or more runs (note: initialization of WRF needs longer the more cores used).
Here instead, the performance metric used is the minimum computing time needed for one of the 45 model time steps.This was done because of large performance fluctuations on Tornado between time steps within and across identical runs.They sum up to large differences in computation time for whole runs, as for averaged model time step performance.
Those fluctuations are supposed to be caused also by the contention for network resources.But this time the network resources are not limited in regard of many cores communicating about the sub-domains of a single job, what limited the maximum usable core count and introduced large performance fluctuations for core counts > 32.This time they were limited in regard of many users computing many jobs especially during rush-hours creating short term performance fluctuations even for runs using 557 Figures

Back Close
Full only 2 nodes or 16 cores.Ideally a HPCs network should be able to handle parallel computation on all its nodes without performance degradation, so jobs of different users won't hit the performance of each other.But in practice the processor within the Infiniband network switch, which handles all the communication requests, often is the bottleneck.Thus the almost random load of the network due to users leads to random performance fluctuations.Therefore the 45 time steps sometimes had a span of up to 500% for 64 cores (see Fig. 3).This decreases with less cores, due to the increasing computations to communications.At 8 cores (1 node) almost no fluctuation is left because no communication has to move through the Infiniband.
The only non-random picture of the fluctuations was that especially before lunchbreak and at Friday afternoons a rush-hour of job-computations appeared, where the fluctuation was the largest.
Hence, on the one hand, with the random fluctuations in mind, measuring the average time over many time steps or whole runs would mask the optimal performance gain.This optimal gain is possible with the optimal compiler on HPCs, where the network is not the limiting factor, as shown by Shainer et al. (2009) who ran WRF with 192 cores, with the performance scaling up almost linear.Newby et al. (2008) even used 16 384 cores without a huge performance degradation per core.
On the other hand, regarding the rush-hour fluctuations, one must run the model several times spread over a day, first to detect if and then when rush-hours appear on the system, and second, to sort out rush-hour influenced measurements.As conclusion model runs with an identical setup (of core count, compiler, MPI, and WRF) had to be repeated at least twice on different days with some distance to the rush-hours.They had to be repeated even more often, if the minimum time needed for one time step was not reproducible or differs strongly from the expected value.

Results
The benchmark results for WRF 3.2.1 are shown in Table 1.The main finding is that the Intel compilers gain up to 26% performance compared to GCC on Tornado, an AMD CPU (model Opteron 2384) based system.In a consequence testing and then choosing the right compiler before doing a big run is absolutely worth the work.While this trend will be valid for different hardware, the percentage will vary however.For example, if your HPC is equipped with Intel instead of AMD CPUs the Intel compilers are likely to gain even more performance compared to GCC.This was shown by the HPC Advisory Council ( 2010), which found up to 92% performance gain for Intel 12.0 versus GCC 4.4 on their Intel CPU (model Xeon 5670) based system; or by Sankaran (2010), who found up to 37% performance gain for PGI 9 versus Intel 11.1 on an Intel CPU (Xeon 5500-series) based system and 34% on a AMD CPU (Opteron 2400-series) based system.But if a newer GCC version (current 4.5.1) is available on your HPC, this already might reduce the performance gap a little bit.In the long run, even the trend shown by the results in Table 1 may change with the advancements of hard-and software.Therefore one should consider testing available software combinations influencing WRFs performance before submitting the main job, if no comparable benchmark result not older than one or two years is available.Another important result is that there is almost no performance difference between the costly Intel and PGI compilers on this platform, or between different compiler release versions of the same manufacturer, or between the different MPI release versions.(Note: while the benchmarking with Open MPI 1.3.3,1.4.0,1.4.3 worked seamlessly, the new version 1.5.1 was able to compile and run WRF only with GCC and worked only on a single node.This will hopefully be fixed in the future.) Furthermore the gap between GCC and Intel/PGI decreases with increasing node numbers down to 20% for four nodes or 32 cores on this system.Table 2 shows the results for WRF 3.1.1.A maximum performance gain over WRF 3.2.1 of 2.7% was found with PGI 9. of measuring accuracy.It is unlikely that this performance gap would increase with other compiler/MPI-combinations, thus no further tests in case of different compilers for WRF 3.1.1were done.But Table 2 additionally shows results between WRF 3.1.1versions compiled with different PGI compiler flags, set within the configure.wrf.This was done to get an idea of why PGI and Intel compilers (under the assumption both have similar optimizations) performed better than GCC.Additionally this shows which flags are responsible for which percentage of performance gain.Asked to the cause of the performance discrepancy between GCC and PGI compilers Mathew Colgrove of PGI wrote (e-mail of 12.1.2011)"It's possible that it's our auto-vectorizer (SSE) but more likely a combination of many optimizations".
PGI and Intel are able to better optimize their compilers because they are reduced to function on the x86 hardware platform.GCC instead aims to support a broad range of hardware and operating systems and thus has to focus more on compatibility as on performance optimizations (GCC platforms, 2011).
Hence 32 cores.The second fluctuation appeared randomly due to the user-induced random load of the network.The third and largest fluctuation appeared periodically at specific rush hours before lunch and on Friday afternoons.All three phenomena appeared in a different type of time-fluctuation and were not easy to identify, but are important to know about for accurate measurements.These findings may help others occupied with benchmarking tasks in the future.

Conclusions and future work
First testing and then choosing a compiler is worth the work, when it comes to run WRF on a HPC, thinking of the many WRF users which just stick to GCC compilers as the easiest "out of the box" solution.In addition a careful monitoring for exceptional performance fluctuations is also needed for accurate benchmarking.Those are tasks which can't be accomplished by every WRF user.Hence, more and continuous publications of the professionals on this matter are important to get the community at least a rough overview on the performance of the countless possible combinations of hardware and software like the compiler.
Aspects particularly missing in this paper are performance values with larger corecounts, with a broader range of compiler and MPI manufacturers/versions, with older than WRF 3.1.1 versions, and upcoming WRF releases (version 3.3, March 2011).Full In numerical weather modelling limited computing capacity is a crucial problem.Especially for climate relevant time spans of more than 30 years simulation time, for extremely high model resolutions, and/or big domains (global) every percent of gained Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | 2.1.Precompiled binaries are only available for the version 3.1 and x86 CPUs via Robert Rozumalski of the US National Weather Service at http: //strc.comet.ucar.edu/wrfems/index.htm.WRFs development is organized and promoted by the National Oceanic and Atmospheric Administration consisting of the National Centres for Environmental Prediction (NCEP) and the Forecast Systems Laboratory, by the Air Force Weather Agency (AFWA), the Naval Research Laboratory, the University of Oklahoma, the Federal Aviation Administration, and NCAR.WRFs advances in physics, numerics, and data assimilation are contributions by a broad and rapidly growing research community.It is in 551 Discussion Paper | Discussion Paper | Discussion Paper | 6.16.60-0.31from March 2006.Security updates and bugfixes are as current as January 2008.The performance of WRF is definitely influenced by the old Kernel version.However, Discussion Paper | Discussion Paper | Discussion Paper | ufacturers Intel, PGI, Sun (now Oracle) and NAG are preinstalled on Tornado.While WRF does not support the NAG compilers, the Sun compilers come without a proper Open MPI installation, which should be compiled with the same compiler as WRF to avoid complications; hence, PGI, Intel and GCC were the one's left to evaluate.(Note: until version 11 the Intel compilers were free of charge for non-commercial use, since the actual version 12 they are chargeable as those from PGI.)There are more compilers supported by WRF (seeWang et al., 2011), but they are not compatible with Tornado's hardware, with exception of the PathScale compiler, which is not preinstalled.It would be worth testing it in future research.
or in the online tutorial just mentioned.The general compile-options, also known as flags, can be altered in the configure.wrf,just like the option that tells WRF to compile for single-or multi-CPU usage Figures Back Close Full Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | little modifications were made solely to resolution and duration via the namelist configuration files (see Appendix B).This has two advantages.First, because other users can implement the same benchmark easily by following the detailed instructions and download locations of the online tutorial case studies.The second advantage is that in consequence there is no new documentation needed.The Default January 2000 case is a winter storm of 24 to 25 January over the east coast of North America with a horizontal resolution of 30 km and 28 vertical levels.At this resolution the domain comprises 74 • 61 • 28 = 126 392 grid points.On Tornado this relatively small amount of points turned out to slow down the computation when using Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Section 4 consequently shows only the reproducible minimum time needed for a time step.This method overcomes the influence of the hardware (infiniband switch) and shows only the possible performance gain through the software (compiler, WRF, MPI) as it was intended by this work in the first place.Discussion Paper | Discussion Paper | Discussion Paper | 04 on two nodes, what almost lies within the range Introduction Discussion Paper | Discussion Paper | Discussion Paper | , the PGI optimization flags fastsse, Mvect = noaltcode, Msmartalloc, Mprefetch = distance:8, and Mfprelaxed were deleted consecutively from configure.wrf,line FCOPTIM, before the compilation.It turned out that the two first flags were responsible for the biggest performance gain.Fastsse enables the use of special instruction sets common in today's x86-CPUs of AMD and Intel and adds 17% speed up at 8 cores.Mvect instructs the vectorizer to generate alternate code for vectorized loops when appropriate (PGI flags, 2011).This feature does not work with WRF since by default it is set to noaltcode, what adds performance.To verify this Mvect was set to altcode, what resulted in a performance degradation of 27%.The other flags had no effect.Another important result of the whole work was the detection of three different performance fluctuations, discussed in Sects.2.4 and 3.The first fluctuation appeared with a too small computations to communications ratio, suspected to be caused by the switching from Infiniband to the slower Gigabit Ethernet under high network loads.It was temporarily circumvented by rising the domains' resolution and limiting the study to Introduction Discussion Paper | Discussion Paper | Discussion Paper | lines DMPARALLEL = 1 # 1 for dmpar and hybrid mode, 0 for smpar and serial (single-CPU) Linux x86 64, PGI compiler with gcc (dmpar) Discussion Paper | Discussion Paper | Discussion Paper | FCBASEOPTS NO G = -w -ftz -align all -fno-alias -fp-model precise $(FORMAT FREE) $(BYTESWAPIO) FCBASEOPTS = $(FCBASEOPTS NO G) $(FCDEBUG) MODULE SRCH FLAG = [...] # Settings for x86 64 Linux,gFortran compiler with gcc (dmpar) -endian -frecord-marker=4 FCBASEOPTS NO G = -w $(FORMAT FREE) $(BYTESWAPIO) FCBASEOPTS = $(FCBASEOPTS NO G) $(FCDEBUG) MODULE SRCH FLAG = [...] Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Fig. 1 .Fig. 3 .
Fig. 1.A schematic of the used model environment.Mainly the WRF Modelling System consists of the WRF Preprocessing System (WPS) and the physical numerical core Advanced Research WRF (ARW) (from Wang et al., 2011).

Table 1 .
Results for WRF 3.2.1.All results are minimal seconds needed for one of 45 model time steps of the Default January 2000 case at 3.33 km horizontal resolution.The strongly varying and sometimes not reproducible (> 5 runs) results of runs with 64 cores are marked *.

Table 2 .
Results for WRF 3.1.1.All results are minimal seconds needed for one of 45 model time steps of the Default January 2000 case at 3.33 km horizontal resolution.The strongly varying and sometimes not reproducible (> 5 runs) results of runs with 64 cores are marked *.