Articles | Volume 8, issue 3
Model experiment description paper
24 Mar 2015
Model experiment description paper |  | 24 Mar 2015

Efficient performance of the Met Office Unified Model v8.2 on Intel Xeon partially used nodes

I. Bermous and P. Steinle

Abstract. The atmospheric Unified Model (UM) developed at the UK Met Office is used for weather and climate prediction by forecast teams at a number of international meteorological centres and research institutes on a wide variety of hardware and software environments. Over its 25 year history the UM sources have been optimised for better application performance on a number of High Performance Computing (HPC) systems including NEC SX vector architecture systems and recently the IBM Power6/Power7 platforms. Understanding the influence of the compiler flags, Message Passing Interface (MPI) libraries and run configurations is crucial to achieving the shortest elapsed times for a UM application on any particular HPC system. These aspects are very important for applications that must run within operational time frames. Driving the current study is the HPC industry trend since 1980 for processor arithmetic performance to increase at a faster rate than memory bandwidth. This gap has been growing especially fast for multicore processors in the past 10 years and it can have significant implication for the performance and performance scaling of memory bandwidth intensive applications, such as the UM. Analysis of partially used nodes on Intel Xeon clusters is provided in this paper for short- and medium-range weather forecasting systems using global and limited-area configurations. It is shown that on the Intel Xeon-based clusters the fastest elapsed times and the most efficient system usage can be achieved using partially committed nodes.

Short summary
The trend in High Performance Computing (HPC) is for less memory bandwidth relative to the computational power of each core. With each CPU having multiple cores, the best way of using HPC systems is not always straightforward. For some time critical applications, shorter run times can be obtained by using only some of the cores per CPU and keeping the others idle. A number of factors are required to consider, but this provides a simple technique for a significant gain in the application speed.