<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">GMD</journal-id><journal-title-group>
    <journal-title>Geoscientific Model Development</journal-title>
    <abbrev-journal-title abbrev-type="publisher">GMD</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Geosci. Model Dev.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1991-9603</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/gmd-12-749-2019</article-id><title-group><article-title>MP CBM-Z V1.0: design for a new Carbon Bond Mechanism Z (CBM-Z) gas-phase chemical mechanism
architecture for next-generation processors</article-title><alt-title>MP CBM-Z V1.0</alt-title>
      </title-group><?xmltex \runningtitle{MP CBM-Z V1.0}?><?xmltex \runningauthor{H. Wang et al.}?>
      <contrib-group>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Wang</surname><given-names>Hui</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2 aff4">
          <name><surname>Lin</surname><given-names>Junmin</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff1">
          <name><surname>Wu</surname><given-names>Qizhong</given-names></name>
          <email>wqizhong@bnu.edu.cn</email>
        <ext-link>https://orcid.org/0000-0001-6308-3083</ext-link></contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff3">
          <name><surname>Chen</surname><given-names>Huansheng</given-names></name>
          <email>chenhuansheng@mail.iap.ac.cn</email>
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff3">
          <name><surname>Tang</surname><given-names>Xiao</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff3">
          <name><surname>Wang</surname><given-names>Zifa</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff3">
          <name><surname>Chen</surname><given-names>Xueshun</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Cheng</surname><given-names>Huaqiong</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Wang</surname><given-names>Lanning</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>College of Global Change and Earth System Science, Joint Center for Global Changes Studies,
<?xmltex \hack{\break}?> Beijing Normal University, Beijing 100875, China</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Intel (China) Corporation, Beijing 100013, China</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>State Key Laboratory of Atmospheric Boundary Layer Physics and Atmospheric Chemistry,
<?xmltex \hack{\break}?> Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China</institution>
        </aff>
        <aff id="aff4"><label>a</label><institution>now at: Artificial Intelligence Research Department, JD Corp., Beijing 100101, China</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Qizhong Wu (wqizhong@bnu.edu.cn) and Huansheng
Chen (chenhuansheng@mail.iap.ac.cn)</corresp></author-notes><pub-date><day>20</day><month>February</month><year>2019</year></pub-date>
      
      <volume>12</volume>
      <issue>2</issue>
      <fpage>749</fpage><lpage>764</lpage>
      <history>
        <date date-type="received"><day>15</day><month>February</month><year>2018</year></date>
           <date date-type="rev-request"><day>24</day><month>April</month><year>2018</year></date>
           <date date-type="rev-recd"><day>13</day><month>November</month><year>2018</year></date>
           <date date-type="accepted"><day>22</day><month>January</month><year>2019</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2019 </copyright-statement>
        <copyright-year>2019</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://gmd.copernicus.org/articles/.html">This article is available from https://gmd.copernicus.org/articles/.html</self-uri><self-uri xlink:href="https://gmd.copernicus.org/articles/.pdf">The full text article is available as a PDF file from https://gmd.copernicus.org/articles/.pdf</self-uri>
      <abstract>
    <p id="d1e180">Precise and rapid air quality simulations and forecasting are
limited by the computational performance of the air quality model used, and
the gas-phase chemistry module is the most time-consuming function in the air
quality model. In this study, we designed a new framework for the widely used
the Carbon Bond Mechanism Z (CBM-Z) gas-phase chemical kinetics kernel to
adapt the single-instruction, multiple-data (SIMD) technology in next-generation
processors to improve its calculation performance. The
optimization implements the fine-grain level parallelization of CBM-Z by
improving its vectorization ability. Through constructing loops and
integrating the main branches, e.g., diverse chemistry sub-schemes, multiple
spatial points in the model can be operated simultaneously on vector
processing units (VPUs). Two generation CPUs – Intel Xeon E5-2680 V4 CPU and
Intel Xeon Gold 6132 – and Intel Xeon Phi 7250 Knights Landing (KNL) are
used as the benchmark processors. The validation of the CBM-Z module outputs
indicates that the relative bias reaches a maximum of 0.025 % after 10 h
integration with <italic>-fp-model fast</italic> <inline-formula><mml:math id="M1" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> compile flag. The results of
the module test show that the Multiple-Points CBM-Z (MP CBM-Z) resulted in
5.16<inline-formula><mml:math id="M2" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 8.97<inline-formula><mml:math id="M3" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup on a single core of Intel Xeon E5-2680
V4 and Intel Xeon Gold 6132 CPUs, respectively, and KNL had a speedup of
3.69<inline-formula><mml:math id="M4" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> compared with the performance of CBM-Z on the Intel Xeon E5-2680
V4 platform. For the single-node tests, the speedup on the two generation
CPUs can reach 104.63<inline-formula><mml:math id="M5" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 198.50<inline-formula><mml:math id="M6" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using message passing
interface (MPI) and 101.02<inline-formula><mml:math id="M7" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 194.60<inline-formula><mml:math id="M8" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using OpenMP, and the
speedup on the KNL node can reach 175.23<inline-formula><mml:math id="M9" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using MPI and 167.45<inline-formula><mml:math id="M10" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using OpenMP. The speedup of
the optimized CBM-Z is approximately 40 % higher on a one-socket KNL
platform than on a two-socket Broadwell platform and about 13 %–16 %
lower than on a two-socket Skylake platform. We also tested a
three-dimensional chemistry transport model (CTM) named Nested Air Quality
Prediction Model System (NAQPMS) equipped with the MP CBM-Z. The tests
illustrate an obvious improvement on the performance for the CTM after
adopting the MP CBM-Z. The results show that the MP CBM-Z leads to a speedup
of 3.32 and 1.96 for the gas-phase chemistry module and the CTM on the Intel
Xeon E5-2680 platform. Moreover, on the new Intel Xeon Gold 6132 platform,
the MP CBM-Z gains 4.90<inline-formula><mml:math id="M11" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 2.22<inline-formula><mml:math id="M12" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedups for the gas-phase
chemistry module and the whole CTM. For the KNL, the MP CBM-Z enables a
3.52<inline-formula><mml:math id="M13" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup for the gas-phase chemistry module, but the whole model
lost 24.10 % performance compared to the CPU platform due to the poor
performance of other modules. In addition, since this optimization seeks to
improve the utilization of the VPU, the model is more suitable for the new
generation processors adopting the more advanced SIMD technology. The results
of our tests already show that the benefit<?pagebreak page750?> of updating CPU improved by about
47 % by using the MP CBM-Z since the optimized code has better
adaptability for the new hardware. This work improves the performance of the
CBM-Z chemical kinetics kernel as well as the calculation efficiency of the
air quality model, which can directly improve the practical value of the air
quality model in scientific simulations and routine forecasting.</p>
  </abstract>
    </article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <title>Introduction</title>
      <p id="d1e289">Air pollution and its impacts on human health have attracted widespread
attention all over the world, especially in developing countries (Gurjar et
al., 2016; Zhang et al., 2017). As a useful tool for air quality problems,
chemistry transport models (CTMs), are widely used in studies of air quality
(Gao et al., 2016;  Chen et al., 2015;  Wu et al., 2014) and in establishing air
quality forecasting (AQF) systems. As the core of the AQF system, a CTM
requires a large number of computational resources to simulate the complex
chemical and physical processes. To satisfy the demand of routine air
quality forecasting in a timely manner, coarse spatial resolution and
relatively simple processes are adopted in CTMs to minimize the use of
computational resources. Meanwhile, other simulation studies with more
complex processes are also limited by computational resources. Therefore,
air quality studies can benefit significantly by improving the performance
of the CTM used.</p>
      <p id="d1e292">In a CTM, the most time-consuming module is the gas-phase chemistry module
(Wang et al., 2017). The gas-phase chemistry module is described as a system
of ordinary differential equations (ODEs) to simulate the chemical kinetics
of trace gases in an atmosphere model (Seinfeld and Pandis, 2012).
Linford et al. (2009) reported that the Regional Acid Deposition Model version 2 (RADM2)
(Zimmermann and Poppe, 1994; Chang et al., 1987), a chemical kinetics kernel,
accounted for 90 % of the computational time in the Weather Forecasting
and Research/Chemistry (WRF-chem) model (Grell et al., 2005). Another widely
used chemical kinetics kernel, the Carbon Bond Mechanism version Z (CBM-Z)
(Zaveri and Peters, 1999), accounts for approximately 68 % of the
computation time in the Global Nested Air Quality Prediction Model System
(GNAQPMS) (Chen et al., 2015; Wang et al., 2017). Therefore,
accelerating the gas-phase chemistry module can directly improve the
performance of the CTM as well as the whole AQF system. The AQF system can
also benefit from the performance improvement by adopting a higher model
resolution and improving the frequency of air quality forecasting.</p>
      <p id="d1e295">The performance of models improves with updated hardware. However, by reaching
the bottleneck of power density and the thermal limitation of the silicon
technology for a single-core design, frequent updating has not been an
efficient way to improve the scientific model's performance. Additionally,
multicore architecture and a heterogeneous computing architecture such as a
Many Integrated Core (MIC) and a graphic processing unit (GPU) have become
the hardware trend for high-performance computing (Xu et al., 2015; Lawrence
et al., 2018). Meanwhile, to take full advantage of the advanced features of
new processor architecture, the applications or the models must be
redesigned or rewritten. Xu et al. (2015) rewrote the Princeton Ocean Model
(POM) using Compute Unified Device Architecture-C (CUDA-C) to port it from a CPU to a GPU platform.
Linford et al. (2009)
also tried to solve the computation bottleneck of RADM2 mentioned above by
using a heterogeneous platform such as GPU–CPU. In addition, our previous
work showed the primary optimizations we performed to accelerate the GNAQPMS on the new generation CPU and Intel MIC platforms (Knights Landing,
KNL; Sodani et al., 2016) and had a significant performance improvement on
both platforms, a 2.77<inline-formula><mml:math id="M14" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup on CPU and a 3.51<inline-formula><mml:math id="M15" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup on the KNL node
(Wang et al., 2017). In this study, we redesign the code structure of the
chemical kinetics kernel CBM-Z to improve its vectorization performance on
the CPU and KNL platforms, which significantly improves its performance by
fully utilizing the single-instruction, multiple-data (SIMD) technology. We
tested the performance of this optimized CBM-Z module as well as a regional
CTM equipped with it. The code test only contained this single module,
making it easier to let the CTM developers reuse the code.</p>
      <p id="d1e312">Section 2.1 in this paper introduces the CBM-Z scheme, and Sect. 2.2
describes the new architecture we designed for CBM-Z. Since multiple spatial
points were operated simultaneously in the optimized CBM-Z scheme, the
optimized CBM-Z scheme was called the Multiple-Points CBM-Z Version 1.0 (MP
CBM-Z V1.0). In Sect. 3.1, we present our benchmark platforms. In Sect. 3.2
and 3.3, we introduce the test cases and present the test results of
single-model tests and CTM tests separately. The conclusions and discussions
are given in Sect. 4.</p>
</sec>
<sec id="Ch1.S2">
  <title>Method description</title>
      <p id="d1e321">CBM-Z is a core module in CTMs that simulates the complex gas-phase chemical
processes in the atmosphere. In this module, too many options and poor load
balancing within the model grid boxes make it a challenge to improve its
performance on a vectorization level. This leads to poor performance of
CBM-Z on the new generation processors that are highly dependent on powerful
vector processing units (VPUs). In our previous work, we conducted several
optimizations on CBM-Z to enhance its vectorization and parallel
performance (Wang et al., 2017). In this work, we attempt to further enhance
its vector calculation ability by constructing a new structure, which makes
the CBM-Z module suitable to be vectorized. The CBM-Z module was extracted
as<?pagebreak page751?> an individual box model to test its performance and improve code
reusability.</p>
<sec id="Ch1.S2.SS1">
  <title>Description of CBM-Z</title>
      <p id="d1e329">CBM-Z is a lumped-structure photochemical mechanism that was developed to
meet the needs of city-scale to global-scale tropospheric chemical
simulations (Zaveri and Peters, 1999). The original scheme contains 67
species and 132 reactions. CBM-Z has been widely used in CTMs, e.g., the
WRF-Chem (San José et al., 2015), the Nested Air Quality Prediction
Model System (NAQPMS) (Wang et al., 2001) and the GNAQPMS. In the
NAQPMS and GNAQPMSs, CBM-Z was further modified by Li et al. (2012).
It was updated to 76 species, and 28 heterogeneous reactions were added. The
CBM-Z solver uses the modified backward Euler (MBE) solver developed by Feng
et al. (2015), a faster and more robust algorithm which overcomes
inflexibility and preserves the non-negativity.</p>
      <p id="d1e332">The main control flow of CBM-Z is shown in Fig. 1. The <italic>IntegrateChemistry</italic>
function is treated as the
core function of the module. CBM-Z contains five chemistry sub-schemes. They
are the Common Chemistry Scheme (COM), the Urban Chemistry Scheme (URB), the
Biogenic Chemistry Scheme (BIO), the Marine Chemistry Scheme (MAR), and the
Heterogeneous Chemistry Scheme (HET). The integration of different
sub-schemes is used to satisfy the simulation of diverse scenarios and
scales. The combination of sub-schemes relies on the concentration and
emission of each chemical species in the specific model grid, which is
implemented in the <italic>SelectGasRegime</italic> function. The variable
<italic>iregime</italic> stores the return-value of
<italic>SelectGasRegime</italic> and controls the subsequent calculation
processes of CBM-Z. The possible values and the sub-schemes represented are
shown in Table 1. The combinations include the COM
and HET schemes, while other schemes are added when the concentration or
emission of a corresponding species in a certain scheme are greater than
zero. Compared with the algorithm computing all chemical interactions, this
algorithm is helpful in saving computational resources on a simple core,
while such irregular and unbalanced calculations lack well-structured loops
and impede the vectorization of code. Besides the chemistry sub-schemes
mentioned above, CBM-Z uses other functional branches, e.g., nocturnal and
diurnal chemistry, and they impede the vectorization of the computation. The
CBM-Z also contains multiple unconstructed scalar operations. We partially
integrated the scalar operations by using indirect indexing to construct
loops for vectorization (Wang et al., 2017). However, this method required
significant effort, and it only reconstructed a limited number of scalar
operations. The CBM-Z module still contains many scalar operations. With
multi-level control flow divergences and many scalar calculations, it is not
feasible to perform automatic vectorization with an Intel compiler.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F1"><caption><p id="d1e349">The framework of the CBM-Z gas-phase chemistry module. The
functions in the yellow font represent the inner function of
<italic>IntegrateChemistry</italic>.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f01.png"/>

        </fig>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T1"><caption><p id="d1e365">The possible values of <italic>iregime</italic> and the combination of chemical
schemes.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="7">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"><italic>iregime</italic></oasis:entry>
         <oasis:entry colname="col2">1</oasis:entry>
         <oasis:entry colname="col3">2</oasis:entry>
         <oasis:entry colname="col4">3</oasis:entry>
         <oasis:entry colname="col5">4</oasis:entry>
         <oasis:entry colname="col6">5</oasis:entry>
         <oasis:entry colname="col7">6</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">COM</oasis:entry>
         <oasis:entry colname="col3">COM</oasis:entry>
         <oasis:entry colname="col4">COM</oasis:entry>
         <oasis:entry colname="col5">COM</oasis:entry>
         <oasis:entry colname="col6">COM</oasis:entry>
         <oasis:entry colname="col7">COM</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Sub-</oasis:entry>
         <oasis:entry colname="col2">HET</oasis:entry>
         <oasis:entry colname="col3">HET</oasis:entry>
         <oasis:entry colname="col4">HET</oasis:entry>
         <oasis:entry colname="col5">HET</oasis:entry>
         <oasis:entry colname="col6">HET</oasis:entry>
         <oasis:entry colname="col7">HET</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">schemes</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3">URB</oasis:entry>
         <oasis:entry colname="col4">URB</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">URB</oasis:entry>
         <oasis:entry colname="col7">URB</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4">BIO</oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7">BIO</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">MAR</oasis:entry>
         <oasis:entry colname="col6">MAR</oasis:entry>
         <oasis:entry colname="col7">MAR</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d1e536">Fortunately, contiguous model grid boxes may have similar chemical processes
in air quality simulations, which provides the opportunity to integrate the
grid boxes with similar or the same chemical processes to implement
vectorization to calculate the processes of multiple grid boxes
simultaneously. The following section introduces the details about
integrating the chemistry sub-schemes to implement the vectorization.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <title>Algorithm description</title>
      <p id="d1e545">The new generation Intel CPU (e.g., Skylake) and Intel MIC chips are
equipped with the AVX-512 (AVX – Advanced Vector Extensions) or more advanced vectorization instructions, which
support a maximum of 8 double-precision and 16 single-precision
operations with 512 bit wide vector registers. It is critical to peak
performance of the next-generation CPUs and MICs to fully reach the
potential of the AVX-512 (Mielikainen et al., 2014). As mentioned in Sect. 2.1,
automatic vectorization using a compiler is impeded by the features of
CBM-Z, and the common manual measures including constructing loops, avoiding
the loop–data dependence, and aligning the data with directives are needed to
further vectorize CBM-Z. On the other hand, to implement the vectorization of
the module, the general design allowed<?pagebreak page752?> the CBM-Z module to handle multiple
grid boxes in one citing cycle, and the functions in CBM-Z were
reconstructed by adding a regular loop for these grid boxes. Subsequently,
these loops can be vectorized to implement the fine-grained parallelization
on a VPU.</p>
      <p id="d1e548">All of the model grid boxes are distributed to multiple cores using a
message passing interface (MPI) and OpenMP, which is a type of coarse-grain
parallelization. Our goal is to implement fine-grained parallelization based
on the SIMD, and the grid boxes that are distributed to a specific processor
operate in parallel using the VPUs on each core. As shown in
Fig. 2, the calling method of the CBM-Z module
changes from calculating one model grid box calculation at a time to
multiple model grid boxes at the same time. The step length
(VLEN in
Fig. 2) of the loops represents the number of the
grid boxes operated simultaneously, and it is determined by the length of
the vector register. The VLEN was set to 16 since the 512 bit wide vector of
the AVX-512 can support 16 single-precision operations at the same time. Using
this framework, the functions in CBM-Z construct an extra loop to manage the
point number dimension, and the corresponding variables require an extra
dimension to store the information of multiple grid boxes. Using the
structure with an extra loop, it was easier to implement vectorization.
Meanwhile, to avoid multiple remaining points which cannot satisfy the VLEN,
we set a common variable array, <italic>pmask</italic> (VLEN) as shown in
Fig. 2, to store the availability label of the
model grid boxes. When the number of remaining grid boxes did not reach
VLEN, the corresponding <italic>pmask</italic> value of excessive grid
boxes was set to “false” to mask these grid boxes in the calculation.
Furthermore, the latitude and longitude dimension loops were merged, from
nested loops to a single loop, to reduce the number of unavailable points as
shown in Fig. 2. Achieving such a large-scale
vectorization also requires load balancing of the calculation processes, but
the calculation branches in CBM-Z are an obstacle to this. Therefore, the
branches in CBM-Z should be taken into consideration in constructing the
loops, especially the chemical schemes chosen in
Table 1. As mentioned in Sect. 2.1, the
contiguous model grid boxes may have similar chemical processes in the
atmosphere. This provides an opportunity to integrate the sub-schemes by
masking the heterogeneous model grid boxes, and this type of masking
operation can be used in the functions <italic>GasRateConstants</italic>
and <italic>ODEsolver</italic> (Fig. 1). Figure 3 shows the flowchart for masking the model grid boxes to satisfy the
vectorization of the grid array. A set of grid boxes with the number of VLEN
(16 in this study) would perform the operation simultaneously, and the
variable <italic>pmask</italic> signed the valid grid boxes. Meanwhile, the
variable <italic>iregime</italic> described in
Table 1 and representing the combination of sub-schemes, is used to determine whether the model grid must perform the
subsequent operation or not. The grid boxes with the same property or
calculation are kept by setting the variable <italic>bmask</italic> to
“true”. The COM and HET schemes are common for all grid boxes, and the mask
operation for COM and HET schemes only determines the availability of the
grid boxes. As shown in Fig. 3, for the URB, BIO, and MAR schemes, the <italic>iregime</italic> value and
<italic>pmask</italic> are both used to filter the heterogeneous grid boxes
and the <italic>bmask</italic> stores the results. To improve the
efficiency of vectorization, the <italic>bmask</italic> does not prevent
the calculation of heterogeneous grid boxes but prevents the calculation
results from being copied back to the return value. Thus, all computations
are performed on all grid boxes, but only the results of the valid grid
boxes are returned. This improves the utilization of data as well as the
efficiency of vectorization. Because of the independence of the grid boxes,
the computation process of VLEN arrays is independent and satisfies the
requirement of vectorization, and the corresponding directives were added to
declare the independence of the arrays and force the compiler to perform the
data alignment and vectorization after the reconstruction of the code.
Overall, by constructing the loops, the computations of the independent grid
boxes were integrated with the fine-level parallel implementation through
the SIMD. In addition, the efficiency of such algorithms is linearly
improved with the development of the width of the vector in the VPU.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F2" specific-use="star"><caption><p id="d1e587">A schematic diagram of the changes in the calling method of CBM-Z.
The calling method of the CBM-Z module changes from calculating one model
grid calculation at a time to multiple model grid boxes at the same time.
The VLEN represents the number of points operated simultaneously, which is
determined by the length of the register in the vector processing unit
(VPU). The <inline-formula><mml:math id="M16" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="M17" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula> loops, equaling latitude and longitude loops, were merged
to construct one vector to reduce the number of unfilled vectors. Panel <bold>(b)</bold> and
<bold>(c)</bold> illustrate the sample code before and after integrating grid boxes.</p></caption>
          <?xmltex \igopts{width=483.69685pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f02.png"/>

        </fig>

      <?xmltex \floatpos{t}?><fig id="Ch1.F3" specific-use="star"><caption><p id="d1e619">The flowchart <bold>(a)</bold> shows the way to mask the heterogeneous girds to
integrate grid boxes to perform the vectorization operations according to
the <italic>iregime</italic> values. Panels <bold>(b)</bold> and <bold>(c)</bold> illustrate the sample code before and after
integrating grid boxes. In panel <bold>(b)</bold>, <italic>iregime</italic> leads different calling
processes;  in panel <bold>(c)</bold>, the calling processes are integrated into one flow,
and the functions are called for all grid boxes but only the values of valid
grid boxes are returned.</p></caption>
          <?xmltex \igopts{width=483.69685pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f03.png"/>

        </fig>

</sec>
</sec>
<sec id="Ch1.S3">
  <title>Test results</title>
      <p id="d1e657">The validation and evaluation of the improvement of the new method were
conducted using the box model of CBM-Z as well as a regional CTM named
Nested Air Quality Prediction Model System with the optimized
CBM-Z scheme. We tested the theoretical performance of vectorization by
using the box model, and the CTM tests illustrate its potential in three
dimensions with varying chemical regimes.</p>
<sec id="Ch1.S3.SS1">
  <title>Benchmark platform description</title>
      <?pagebreak page754?><p id="d1e665">The computation cluster for tests was provided by the Institute of Atmospheric
Physic (IAP), Chinese Academy of Sciences (CAS). The CPU and KNL platforms
were used for testing the code. The CPU platforms in this study include two
generation CPUs, two-socket CPU nodes with Broadwell architecture 2.4 GHz
14-core Intel Xeon E5-2680 V4 processors, and two-socket CPU nodes with 2.6 GHz
Skylake architecture 14-core Intel Xeon Gold 6132. To the vector
instructions, the previous generation of Broadwell adopted the AVX-2 vector
instructions and the new generation used the AVX-512 vector instructions.
The AVX-512 and AVX-2 instructions support 16 and 8 single-precision
floating-point calculations simultaneously, respectively. Comparing the two
generation CPUs helped to present the potential of new MP CBM-Z to fully use
the development of hardware. The KNL node contained one 1.4 GHz 68-core
Intel Xeon Phi 7250 processor, which also adopted the AVX-512 vector
instructions. The operating system was Cent OS Linux 7.4.1708 for all
platforms. The code was all compiled using the Intel FORTRAN Compiler 2017
update 4, and the compile flags for vectorization and float-pointing
accuracy of the CBM-Z module and the NAQPMS are shown in
Tables 2 and 3, respectively. The
corresponding flags for vectorization (e.g., <italic>-xCORE-AVX2</italic>, <italic>-xCOMMON-AVX512</italic>,
<italic>-xMIC-AVX512</italic>, <italic>-align array64byte</italic>) were adopted for MP CBM-Z. We also tested
the code using diverse options for the compile flag <?xmltex \hack{\mbox\bgroup}?><italic>-fp-model</italic><?xmltex \hack{\egroup}?>, which
controls the balance between accuracy and performance of floating-point
calculations, to investigate its impact on code. We mainly consider the two
options of <italic>-fp-model precise</italic> and <italic>-fp-model fast<inline-formula><mml:math id="M18" display="inline"><mml:mo mathvariant="normal">=</mml:mo></mml:math></inline-formula>1</italic>.
The <italic>fast</italic><inline-formula><mml:math id="M19" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> is the default option when
<italic>-fp-model</italic> flag is not selected. Compared with the option <italic>precise</italic>,
the <italic>fast</italic><inline-formula><mml:math id="M20" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> improves
the computational performance but reduces the accuracy of the floating-point
calculations. Using <italic>precise</italic> is a safer option and forces the compiler to avoid
the vectorization of some calculations to improve accuracy. We compare the
results of the two options, including the outputs and the performance of
models, to investigate its impact and discuss a suitable choice of compile flag.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T2" specific-use="star"><caption><p id="d1e734">Compile flags of the different versions of CBM-Z.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="4">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:thead>
       <oasis:row>

         <oasis:entry colname="col1">Version of CBM-Z</oasis:entry>

         <oasis:entry colname="col2">Processor</oasis:entry>

         <oasis:entry rowsep="1" namest="col3" nameend="col4" align="center">Intel compiler flags </oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2"/>

         <oasis:entry colname="col3"/>

         <oasis:entry colname="col4">Flags for</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2"/>

         <oasis:entry colname="col3">Flags for vectorization</oasis:entry>

         <oasis:entry colname="col4">floating-point accuracy</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="3">Baseline CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M21" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="4">MP CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M24" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">Xeon Phi 7250</oasis:entry>

         <oasis:entry colname="col3"><italic>-xMIC-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M25" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T3" specific-use="star"><caption><p id="d1e968">Compile flags of the different versions of NAQPMS.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="4">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:thead>
       <oasis:row>

         <oasis:entry colname="col1">Version of NAQPMS</oasis:entry>

         <oasis:entry colname="col2">Processor</oasis:entry>

         <oasis:entry rowsep="1" namest="col3" nameend="col4" align="center">Intel compiler flags </oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2"/>

         <oasis:entry colname="col3"/>

         <oasis:entry colname="col4">Flags for</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2"/>

         <oasis:entry colname="col3">Flags for vectorization</oasis:entry>

         <oasis:entry colname="col4">floating-point accuracy</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="3">Baseline NAQPMS</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M26" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M27" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="4">NAQPMS with  MP CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCORE-AVX2</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M28" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model precise</italic></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col3"><italic>-xCOMMON-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M29" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2">Xeon Phi 7250</oasis:entry>

         <oasis:entry colname="col3"><italic>-xMIC-AVX512</italic></oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model fast</italic> <inline-formula><mml:math id="M30" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S3.SS2">
  <title>Box model test</title>
      <p id="d1e1205">The box model of MP CBM-Z was used to validate the model outputs and
investigate the ideal parallel performance of the single module. We also
tested the results using different parallelization techniques, e.g., MPI and
OpenMP. Each test was repeated 10 times to reduce the impact from any
platform variability.</p>
<sec id="Ch1.S3.SS2.SSS1">
  <title>Test case description</title>
      <p id="d1e1213">There are two cases that were used for the CBM-Z box model. One was a 10 h
single grid box case with all species to validate the outputs of the model,
and the other was a 1 h simulation with <inline-formula><mml:math id="M31" display="inline"><mml:mrow><mml:mn mathvariant="normal">160</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">148</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula> grid boxes to test the
performance of the module under a more realistic scenario. The initial
values for the single grid box are shown in Table S1 in the Supplement. The meteorological
conditions were constant and emissions were set to zero to test the error of
the algorithms. The time step was 5 s for the two cases. For
validation purposes, output every 5 min was used, while the
computational performance test did not include the output function to
eliminate any impact from input–output (I/O). The different compiling flags for the
precision of floating-point calculations are presented in Table 2. We test
the baseline and the optimized model on two different platforms of CPU and
KNL, and the computational time was counted using the
<italic>system_clock</italic> function.</p>
</sec>
<sec id="Ch1.S3.SS2.SSS2">
  <title>Box model validation</title>
      <p id="d1e1241">We evaluate the chemical species including ozone (<inline-formula><mml:math id="M32" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>), nitrogen dioxide
(<inline-formula><mml:math id="M33" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>), nitrogen monoxide (NO), hydrogen peroxide (<inline-formula><mml:math id="M34" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">H</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>),
sulfur dioxide (<inline-formula><mml:math id="M35" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>), sulfuric acid (<inline-formula><mml:math id="M36" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">H</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">4</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>), hydroxyl (OH)
radical, hydroperoxyl (<inline-formula><mml:math id="M37" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">HO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>) radical, and alkyl peroxy (<inline-formula><mml:math id="M38" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">RO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>)
radical. These species are important for tropospheric gas-phase chemistry and
sulfate aerosol formation and hence suitable for validating whether the
optimization significantly changed the simulated results or not.
Figure 4 shows the time series of the simulated
concentrations of the species by the baseline (base) and the
optimized (opt) model with the <italic>precise </italic> and <italic> fast</italic><inline-formula><mml:math id="M39" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flags.
The results with the baseline
code with <italic>precise</italic> compile flag is the benchmark, and there
is no difference between the results from the baseline and optimized code
with the same <italic>precise</italic> compile flag. The
<italic>precise</italic> compile flag is a relatively safe compile flag and
prohibits optimizations that can affect the accuracy. The
<italic>fast</italic><inline-formula><mml:math id="M40" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag can lead to errors
even with the same code, but the relative error (RE) of the baseline code
with <italic>fast</italic><inline-formula><mml:math id="M41" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag relative to
the benchmark is extremely small (<inline-formula><mml:math id="M42" display="inline"><mml:mrow><mml:mo>&lt;</mml:mo><mml:mn mathvariant="normal">0.0002</mml:mn></mml:mrow></mml:math></inline-formula> %). As shown in Fig. 4,
with the optimized code, the <italic>fast</italic><inline-formula><mml:math id="M43" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic>
compile flag results in a maximum RE of 0.025 % for NO and <inline-formula><mml:math id="M44" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> at the
end of the simulation. We find that the error caused by the
<italic>fast</italic><inline-formula><mml:math id="M45" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag
did not become obvious for species with low concentrations of OH and
<inline-formula><mml:math id="M46" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">RO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>. We will further discuss
the impact of the <italic>fast</italic><inline-formula><mml:math id="M47" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag
in Sect. 3.3.2 in the context of CTM simulations.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F4" specific-use="star"><caption><p id="d1e1447">Comparison of the time-series concentrations of <inline-formula><mml:math id="M48" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, NO,
<inline-formula><mml:math id="M49" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M50" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">H</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M51" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, OH, <inline-formula><mml:math id="M52" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">HO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M53" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">RO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and
<inline-formula><mml:math id="M54" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">H</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">4</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> <bold>(a–i)</bold> from the baseline and optimized CBM-Z simulation
with diverse <italic>-fp-model</italic> options. The simulation results by the baseline code
with the <italic>-fp-model precise</italic> compile flag was as the benchmark. The solid
lines show the time-series concentrations of the species from different
experiments and the dashed lines showed the relative errors (RE) of
simulated concentrations between the benchmark and the results by other
combinations of the code and <italic>-fp-model</italic> options.</p></caption>
            <?xmltex \igopts{width=483.69685pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f04.png"/>

          </fig>

</sec>
<sec id="Ch1.S3.SS2.SSS3">
  <title>Box model computational performance</title>
      <p id="d1e1562">The case with <inline-formula><mml:math id="M55" display="inline"><mml:mrow><mml:mn mathvariant="normal">160</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">148</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula> grid boxes was used to test the computational
performance. Both the baseline and the optimized version of CBM-Z contained
the same 76 species. The computational time of the baseline version on a
single core of E5-2680 V4 CPU with the <italic>precise</italic> compile
flag was considered as the benchmark time. The tests were done with two
generations of CPUs and KNL.</p>
      <p id="d1e1584">The option of <italic>-fp-model</italic> could directly affect the performance. As shown in
Table 4, the benchmark performance was 1014.67 s on the E5-2680 V4
platform. By using the new platform with Intel Gold 6132, the baseline
version code achieves 1.52<inline-formula><mml:math id="M56" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup with the <italic>precise</italic> compile flag. The
<italic>fast</italic><inline-formula><mml:math id="M57" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1 </italic> compile flag leads to 1.28<inline-formula><mml:math id="M58" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and
2.04<inline-formula><mml:math id="M59" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedups for the baseline code on both CPUs. Meanwhile, updating the
CPU enables the original CBM-Z module to gain a speedup of about 1.52 and
1.59 with <italic>precise</italic> and
<italic>fast</italic><inline-formula><mml:math id="M60" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flags, respectively.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T4" specific-use="star"><caption><p id="d1e1643">The performance tests of the baseline and optimized code on
different CPUs and KNL platforms with one physical cores. The unit of the
wall times for the tests is seconds (s).</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="6">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2">Processor</oasis:entry>

         <oasis:entry colname="col3">Vector instruction</oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model</italic></oasis:entry>

         <oasis:entry colname="col5">Wall time</oasis:entry>

         <oasis:entry colname="col6">Speedup</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="3">Baseline CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX2</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">1014.67</oasis:entry>

         <oasis:entry colname="col6">1.00</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M61" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">792.03</oasis:entry>

         <oasis:entry colname="col6">1.28</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX512</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">665.44</oasis:entry>

         <oasis:entry colname="col6">1.52</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M62" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">497.64</oasis:entry>

         <oasis:entry colname="col6">2.04</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="5">MP CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX2</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">581.14</oasis:entry>

         <oasis:entry colname="col6">1.75</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M63" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">153.32</oasis:entry>

         <oasis:entry colname="col6">6.62</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX512</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">352.00</oasis:entry>

         <oasis:entry colname="col6">2.88</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M64" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">55.42</oasis:entry>

         <oasis:entry colname="col6">18.31</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2" morerows="1">Xeon Phi 7250</oasis:entry>

         <oasis:entry colname="col3" morerows="1">AVX512</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">3454.90</oasis:entry>

         <oasis:entry colname="col6">0.29</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M65" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">214.09</oasis:entry>

         <oasis:entry colname="col6">4.74</oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d1e1886">The MP CBM-Z module shows good performance on both CPUs. On the E5-2680 V4
CPU with Broadwell architecture, the optimized code with two different
compile flags consumed 581.14 and 153.32 s, respectively;
meanwhile, the speedups reach 1.75<inline-formula><mml:math id="M66" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 6.62<inline-formula><mml:math id="M67" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> compared with the benchmark
performance. In regard to the Intel Gold 6132 platform, the
optimized version CBM-Z consumed 352.00 and 55.42 s with
<italic>precise</italic> and <italic> fast</italic><inline-formula><mml:math id="M68" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic>
compile flags, respectively. Compared with the benchmark time, the speedups
reach 2.88<inline-formula><mml:math id="M69" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 18.31<inline-formula><mml:math id="M70" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula>. By using the same
<italic>fast</italic><inline-formula><mml:math id="M71" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile option, the MP CBM-Z
shows 5.16<inline-formula><mml:math id="M72" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 8.97<inline-formula><mml:math id="M73" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedups on two generations of CPU compared with the
original CBM-Z code.</p>
      <p id="d1e1958">The results also illustrate that the optimized code could better utilize the
updating of cores through good vectorization ability compared with the
baseline code. Comparing the performance of the optimized code, we find that
updating the CPU could lead to about 1.65 times and 2.76 times acceleration
with <italic> precise</italic> and
<italic>fast</italic><inline-formula><mml:math id="M74" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flags, respectively,
which is higher than the 1.5<inline-formula><mml:math id="M75" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup gained with the baseline code.</p>
      <?pagebreak page755?><p id="d1e1982">Compile flags largely affect the code performance on KNL. On the Xeon Phi
7250 platform, the optimized code took 3454.90 s with the
<italic>precise</italic> compile flag since the majority of vectorizations
were forbidden, and it is even slower than the benchmark performance;  it only
took 214.09 seconds and obtained a speedup of 4.74<inline-formula><mml:math id="M76" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> with the
<italic>fast</italic><inline-formula><mml:math id="M77" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag. Compared with
the baseline CBM-Z with the <italic>fast</italic><inline-formula><mml:math id="M78" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> flag on Intel Xeon E5-2680 V4, KNL gains a 3.69<inline-formula><mml:math id="M79" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup with
the MP CBM-Z.</p>
      <p id="d1e2025">In addition, the baseline and optimized code with
<italic>fast</italic><inline-formula><mml:math id="M80" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> were also analyzed by using the
high-performance computing (HPC) performance characterization
from the Intel VTune tools on the CPU
platform. On the Intel Gold 6132 platform, the single-precision
giga-floating point operations calculated per second (GFLOPS) increased from
4.81 to 21.37 compared with the original CBM-Z module, and the vector
capacity usage improved from 14.3 % in the baseline CBM-Z to 89.4 % in
the MP CBM-Z, which implies that the majority of floating-point instructions
in CBM-Z were vectorized.</p>
      <?pagebreak page757?><p id="d1e2039">We also tested the parallel version of the MP CBM-Z by
compiling with the <italic> fast</italic><inline-formula><mml:math id="M81" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option and
with MPI and OpenMP separately. We evaluated the speedups based on the
performance of the baseline CBM-Z on the Intel Xeon E5-2680 V4 platform with
<italic>fast</italic><inline-formula><mml:math id="M82" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option. The results are shown
in Table 5. The MPI and OpenMP version of CBM-Z had a 104.63<inline-formula><mml:math id="M83" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup and
101.02<inline-formula><mml:math id="M84" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup on the Intel Xeon E5-2680 V4 platform. On the new Intel Xeon
Gold 6132, the MP CBM-Z got a speedup of 198.50<inline-formula><mml:math id="M85" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 194.60<inline-formula><mml:math id="M86" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> with MPI and
OpenMP. For the KNL, the speedup reached 175.23<inline-formula><mml:math id="M87" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> by using MPI and 167.45<inline-formula><mml:math id="M88" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> by
using OpenMP, which was approximately 40 % faster than those on the
two-socket Broadwell platform with AVX2 vectorization instruction and about
13 %–16 % slower than those on the two-socket Skylake platform
with the same AVX512 vectorization instruction. The combination of the
fine-grain vectorization and the coarse-grain parallelization of OpenMP/MPI
results in a significant performance improvement on the new generation
processors. The enhancement of the vectorization performance may be the key
to fully using the new generation processors equipped with advanced and
wider vectors and can be important in making full use of the new MIC
architecture processors such as KNL.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T5" specific-use="star"><caption><p id="d1e2110">The performance tests of the optimized code on different CPUs and
KNL platforms with MPI and OpenMP. The unit of the wall times for the tests
is seconds (s).</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="6">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:tbody>

       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry rowsep="1" namest="col2" nameend="col6">Single core test </oasis:entry>

       </oasis:row>

       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2"/>

         <oasis:entry colname="col3">Vector</oasis:entry>

         <oasis:entry colname="col4">Number</oasis:entry>

         <oasis:entry colname="col5">Wall</oasis:entry>

         <oasis:entry colname="col6"/>

       </oasis:row>

       <oasis:row>

         <?xmltex \rotentry?><oasis:entry colname="col1" morerows="9">MP CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col2">Processor</oasis:entry>

         <oasis:entry rowsep="1" colname="col3">instruction</oasis:entry>

         <oasis:entry rowsep="1" colname="col4">of cores</oasis:entry>

         <oasis:entry rowsep="1" colname="col5">time</oasis:entry>

         <oasis:entry rowsep="1" colname="col6">Speedup</oasis:entry>

       </oasis:row>

       <oasis:row rowsep="1">

         <oasis:entry colname="col2">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3">AVX2</oasis:entry>

         <oasis:entry colname="col4">1</oasis:entry>

         <oasis:entry colname="col5">792.03</oasis:entry>

         <oasis:entry colname="col6">1.00</oasis:entry>

       </oasis:row>

       <oasis:row rowsep="1">

         <oasis:entry namest="col2" nameend="col6">MPI with vectorization </oasis:entry>

       </oasis:row>

       <oasis:row>

         <oasis:entry colname="col2">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3">AVX2</oasis:entry>

         <oasis:entry colname="col4">28</oasis:entry>

         <oasis:entry colname="col5">7.57</oasis:entry>

         <oasis:entry colname="col6">104.63</oasis:entry>

       </oasis:row>

       <oasis:row>

         <oasis:entry colname="col2">Xeon Gold 6132</oasis:entry>

         <oasis:entry colname="col3">AVX512</oasis:entry>

         <oasis:entry colname="col4">28</oasis:entry>

         <oasis:entry colname="col5">3.99</oasis:entry>

         <oasis:entry colname="col6">198.50</oasis:entry>

       </oasis:row>

       <oasis:row rowsep="1">

         <oasis:entry colname="col2">Xeon Phi 7250</oasis:entry>

         <oasis:entry colname="col3">AVX512</oasis:entry>

         <oasis:entry colname="col4">68</oasis:entry>

         <oasis:entry colname="col5">4.52</oasis:entry>

         <oasis:entry colname="col6">175.23</oasis:entry>

       </oasis:row>

       <oasis:row rowsep="1">

         <oasis:entry namest="col2" nameend="col6">OpenMP with vectorization </oasis:entry>

       </oasis:row>

       <oasis:row>

         <oasis:entry colname="col2">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry colname="col3">AVX2</oasis:entry>

         <oasis:entry colname="col4">28</oasis:entry>

         <oasis:entry colname="col5">7.84</oasis:entry>

         <oasis:entry colname="col6">101.02</oasis:entry>

       </oasis:row>

       <oasis:row>

         <oasis:entry colname="col2">Xeon Gold 6132</oasis:entry>

         <oasis:entry colname="col3">AVX512</oasis:entry>

         <oasis:entry colname="col4">28</oasis:entry>

         <oasis:entry colname="col5">4.07</oasis:entry>

         <oasis:entry colname="col6">194.60</oasis:entry>

       </oasis:row>

       <oasis:row>

         <oasis:entry colname="col2">Xeon Phi 7250</oasis:entry>

         <oasis:entry colname="col3">AVX512</oasis:entry>

         <oasis:entry colname="col4">68</oasis:entry>

         <oasis:entry colname="col5">4.73</oasis:entry>

         <oasis:entry colname="col6">167.45</oasis:entry>

       </oasis:row>

     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
</sec>
<sec id="Ch1.S3.SS3">
  <title>CTM test</title>
      <p id="d1e2331">The regional CTM, the NAQPMS (Wang et al., 2001; ZiFa et al., 2006), was used to test the MP CBM-Z
module under more realistic conditions. The following subsections will
describe the CTM test case and will present results from the scientific
validation and its computational performance.</p>
<sec id="Ch1.S3.SS3.SSS1">
  <title>CTM test case description</title>
      <p id="d1e2339">The NAQPMS is a regional CTM developed by IAP, CAS (Li et al., 2011, 2013),
and has been widely used in air quality research (Wang et al., 2018) and
routine air quality forecasting (Wu et al., 2010; Chen et al., 2013). NAQPMS
involves all essential processes including diffusion, advection, dry and wet
deposition, and multiphase chemistry reactions. More details can be found
in Li et al. (2013). In a similar way to the box model test case, the NAQPMS with the baseline and optimized CBM-Z modules were compiled with
various compile flags as shown in Table 3.</p>
      <p id="d1e2342">The test case is a 72 h simulation covering the East Asia region. The
horizontal resolution is 15 km with <inline-formula><mml:math id="M89" display="inline"><mml:mrow><mml:mn mathvariant="normal">339</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">432</mml:mn></mml:mrow></mml:math></inline-formula> grid boxes. The model adopted 20
vertical layers. The meteorological fields driving the NAQPMS were
provided by the Weather Research and Forecasting (WRF) model (Skamarock et
al., 2008). The anthropogenic emission inventory was from the Hemispheric
Transport of Air Pollution (HTAP) V2 and the biogenic emission inventory was
provided by results from Sindelarova et al. (2014) using the Model of
Emissions of Gases and Aerosols from Nature (MEGAN) (Guenther et al., 2006,
2012). The simulation started at 00:00 UTC, 17 August 2015, and ended at
00:00 UTC, 20 August 2015. We only used one node for testing to exclude the
interference of network communication. Each experiment was repeated five times
and the performance was assessed on the basis of the average value.</p>
</sec>
<sec id="Ch1.S3.SS3.SSS2">
  <title>CTM validation</title>
      <p id="d1e2363">We chose four major gas pollutants, i.e., <inline-formula><mml:math id="M90" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M91" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M92" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and
CO, after 72 h integration to evaluate the optimized code. The simulation
results of the baseline NAQPMS code compiled by the <italic>precise</italic>
flag were as the benchmark results, and we mainly compared the simulation
results of the baseline NAQPMS code with the
<italic>fast</italic><inline-formula><mml:math id="M93" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> flag and the optimized
NAQPMS with <italic>precise</italic> and
<italic>fast</italic><inline-formula><mml:math id="M94" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic>.</p>
      <p id="d1e2428">Figures 5 and 6 present the spatial distributions of <inline-formula><mml:math id="M95" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>,
<inline-formula><mml:math id="M96" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M97" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and CO as well as the absolute errors (AEs) of their
concentrations from other experiments relative to the baseline. We find that
all model results show the same spatial distribution of pollutants. In
general, for <inline-formula><mml:math id="M98" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M99" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M100" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, the AEs in the majority of grid
boxes are in the range of <inline-formula><mml:math id="M101" display="inline"><mml:mrow><mml:mo>±</mml:mo><mml:mn mathvariant="normal">0.02</mml:mn></mml:mrow></mml:math></inline-formula> ppbv for the three experiments;
for CO, the AEs of baseline and optimized NAQPMS with the same
<italic>fast</italic> <inline-formula><mml:math id="M102" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> are outside that range, showing more obvious AEs than that of other species.</p>

      <?xmltex \floatpos{p}?><fig id="Ch1.F5" specific-use="star"><caption><p id="d1e2523"><inline-formula><mml:math id="M103" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M104" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> concentrations outputted by baseline and
optimized code with different accuracy compile flags. Panels <bold>(a)</bold> and <bold>(h)</bold> are from
baselines code compiled by the <italic>precise</italic> option, which are treated as benchmark for
comparison. Panels <bold>(b)</bold> and <bold>(i)</bold> are from optimized code compiled by the <italic>precise</italic>
option. Panels <bold>(c)</bold> and <bold>(j)</bold> are from baseline codes compiled by the <italic>fast</italic><inline-formula><mml:math id="M105" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> flag.
Panels <bold>(d)</bold> and <bold>(k)</bold> are from
optimized code compiled by the <italic>fast</italic><inline-formula><mml:math id="M106" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> flag. Panels <bold>(e–g)</bold> and <bold>(l–m)</bold> are the output
concentration differences of optimized code (precise), baseline code
(fast<inline-formula><mml:math id="M107" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1), and optimized code (fast<inline-formula><mml:math id="M108" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1) compared with baseline code
(precise).</p></caption>
            <?xmltex \igopts{width=483.69685pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f05.png"/>

          </fig>

      <?xmltex \floatpos{p}?><fig id="Ch1.F6" specific-use="star"><caption><p id="d1e2631"><inline-formula><mml:math id="M109" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and CO concentrations outputted by baseline and
optimized code with different accuracy compile flags. Panels <bold>(a)</bold> and <bold>(h)</bold> are from
baselines code compiled by the <italic>precise</italic> option, which are treated as benchmark for
comparison. Panels <bold>(b)</bold> and <bold>(i)</bold> are from optimized code compiled by the <italic>precise</italic>
option. Panels <bold>(c)</bold> and <bold>(j)</bold> are from baseline codes compiled by the <italic>fast</italic><inline-formula><mml:math id="M110" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic>
flag. Panels <bold>(d)</bold> and <bold>(k)</bold> are from
optimized code compiled by the <italic>fast</italic><inline-formula><mml:math id="M111" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> flag. Panels <bold>(e–g)</bold> and <bold>(l–m)</bold> are the output
concentration differences of optimized code (precise), baseline code
(fast<inline-formula><mml:math id="M112" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1), and optimized code (fast<inline-formula><mml:math id="M113" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1) compared with baseline code
(precise).</p></caption>
            <?xmltex \igopts{width=483.69685pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f06.png"/>

          </fig>

      <p id="d1e2725">The <italic>precise</italic> option enables the results of the two versions
to be more consistent. Figure 7 shows the distribution of AEs and relative
error (REs) for four species in the near-surface model layer. For the
majority of points, the AEs and REs are in a relatively small range.
However, some points show exceptional and obvious errors. The maximum AEs
for <inline-formula><mml:math id="M114" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M115" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M116" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and CO are 0.166, 0.197, 0.001, and 0.03 ppb
over the whole map after 72 h of integration, and the
<italic>fast</italic><inline-formula><mml:math id="M117" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option shows more obvious
errors for both versions. For the baseline NAQPMS code, using
<italic>fast</italic><inline-formula><mml:math id="M118" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> leads to maximum AEs of 0.23,
4.5, 0.17, and 2.6 ppbv for <inline-formula><mml:math id="M119" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M120" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M121" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and CO, respectively.
To NAQPMS with the MP CBM-Z, using the <italic> fast</italic><inline-formula><mml:math id="M122" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option leads to maximum 0.13, 0.93, 0.76, and
0.64 ppbv AEs for <inline-formula><mml:math id="M123" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M124" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M125" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and CO over the whole domain,
which is better than the baseline NAQPMS.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F7" specific-use="star"><caption><p id="d1e2867">The distributions of absolute errors and relative errors for
<inline-formula><mml:math id="M126" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M127" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M128" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and CO in the near-surface model layer. The
reference points are 1 %, 25 %, 50 %, 75 %, and 99 %.</p></caption>
            <?xmltex \igopts{width=497.923228pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/12/749/2019/gmd-12-749-2019-f07.png"/>

          </fig>

      <p id="d1e2909">In addition to considering the accuracy mentioned above, the impact of the
<italic>-fp-model</italic> option on performance should be considered. In some pragmatic
applications like routine air quality prediction, it is reasonable to
sacrifice accuracy to gain computational performance. Conversely,
applications like long-term climate simulations, choosing safer compile
flags, or adopting double-precision for calculations to avoid
accumulation of errors.</p>
</sec>
<sec id="Ch1.S3.SS3.SSS3">
  <title>CTM computational performance</title>
      <p id="d1e2921">The performance of the baseline NAQPMS with <italic>precise</italic> was
the benchmark for comparison with other tests. As shown in Table 6, in the
original version of NAQPMS, the CBM-Z module accounts for 72.26 % of the
wall-clock time for the whole simulation. Changing the compile option of
<italic>-fp-model</italic> to improve performance by sacrificing accuracy leads to 1.34<inline-formula><mml:math id="M129" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and
1.25<inline-formula><mml:math id="M130" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedups for the module CBM-Z and the whole model on the Intel Xeon
E5-2680 platform, respectively. By updating the CPU from Intel Xeon E5-2680
to Intel Xeon Gold 6132, the module CBM-Z and whole model gain 1.28<inline-formula><mml:math id="M131" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and
1.29<inline-formula><mml:math id="M132" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedups, respectively. The speedups improve to 1.68<inline-formula><mml:math id="M133" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 1.58<inline-formula><mml:math id="M134" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> for
CBM-Z and the whole model, respectively, by using the
<italic>fast</italic><inline-formula><mml:math id="M135" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag on the Xeon Gold 6132
platform. The benefit from updating hardware is limited with the baseline
code and supports the need for optimizing code to adapt to the new hardware
features.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T6" specific-use="star"><caption><p id="d1e2987">The performance tests of the baseline and optimized code on the
diverse platforms with different compile flags. The unit of the wall times
for the tests is seconds (s).</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="8">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:thead>
       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2">Vector</oasis:entry>

         <oasis:entry colname="col3"/>

         <oasis:entry colname="col4">Wall time</oasis:entry>

         <oasis:entry colname="col5">Wall time</oasis:entry>

         <oasis:entry colname="col6">Speedup</oasis:entry>

         <oasis:entry colname="col7">Speedup</oasis:entry>

         <oasis:entry colname="col8"/>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2">processor</oasis:entry>

         <oasis:entry colname="col3">Instruction</oasis:entry>

         <oasis:entry colname="col4"><italic>-fp-model</italic></oasis:entry>

         <oasis:entry colname="col5">(CBMZ)</oasis:entry>

         <oasis:entry colname="col6">(Total)</oasis:entry>

         <oasis:entry colname="col7">(CBMZ)</oasis:entry>

         <oasis:entry colname="col8">(Total)</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col1" morerows="3">Baseline NAQPMS</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX2</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">17 675.86</oasis:entry>

         <oasis:entry colname="col6">24 460.54</oasis:entry>

         <oasis:entry colname="col7">1.00</oasis:entry>

         <oasis:entry colname="col8">1.00</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M136" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">13 201.56</oasis:entry>

         <oasis:entry colname="col6">19 619.20</oasis:entry>

         <oasis:entry colname="col7">1.34</oasis:entry>

         <oasis:entry colname="col8">1.25</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX512</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">13 817.24</oasis:entry>

         <oasis:entry colname="col6">18 950.95</oasis:entry>

         <oasis:entry colname="col7">1.28</oasis:entry>

         <oasis:entry colname="col8">1.29</oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M137" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">10 544.60</oasis:entry>

         <oasis:entry colname="col6">15 502.39</oasis:entry>

         <oasis:entry colname="col7">1.68</oasis:entry>

         <oasis:entry colname="col8">1.58</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon E5-2680 V4</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX2</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">11 127.90</oasis:entry>

         <oasis:entry colname="col6">17 454.95</oasis:entry>

         <oasis:entry colname="col7">1.59</oasis:entry>

         <oasis:entry colname="col8">1.40</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry rowsep="1" colname="col4">fast<inline-formula><mml:math id="M138" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry rowsep="1" colname="col5">3971.48</oasis:entry>

         <oasis:entry rowsep="1" colname="col6">10 019.21</oasis:entry>

         <oasis:entry rowsep="1" colname="col7">4.45</oasis:entry>

         <oasis:entry rowsep="1" colname="col8">2.44</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1">NAQPMS with</oasis:entry>

         <oasis:entry rowsep="1" colname="col2" morerows="1">Xeon Gold 6132</oasis:entry>

         <oasis:entry rowsep="1" colname="col3" morerows="1">AVX512</oasis:entry>

         <oasis:entry colname="col4">precise</oasis:entry>

         <oasis:entry colname="col5">9584.59</oasis:entry>

         <oasis:entry colname="col6">14 698.38</oasis:entry>

         <oasis:entry colname="col7">1.84</oasis:entry>

         <oasis:entry colname="col8">1.66</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1">MP CBM-Z</oasis:entry>

         <oasis:entry rowsep="1" colname="col4">fast<inline-formula><mml:math id="M139" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry rowsep="1" colname="col5">2150.20</oasis:entry>

         <oasis:entry rowsep="1" colname="col6">6994.43</oasis:entry>

         <oasis:entry rowsep="1" colname="col7">8.22</oasis:entry>

         <oasis:entry rowsep="1" colname="col8">3.50</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1"/>

         <oasis:entry colname="col2">Xeon Phi 7250</oasis:entry>

         <oasis:entry colname="col3">AVX512</oasis:entry>

         <oasis:entry colname="col4">fast<inline-formula><mml:math id="M140" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>1</oasis:entry>

         <oasis:entry colname="col5">2997.96</oasis:entry>

         <oasis:entry colname="col6">19 239.20</oasis:entry>

         <oasis:entry colname="col7">5.90</oasis:entry>

         <oasis:entry colname="col8">1.27</oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <?pagebreak page760?><p id="d1e3321">The computational performance of the gas-phase chemistry module and the
NAQPMS are largely improved after adopting the MP CBM-Z, as described
in this paper. As shown in Table 6, the CBM-Z model and the whole NAQPMS shows speedups of 1.59<inline-formula><mml:math id="M141" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 1.40<inline-formula><mml:math id="M142" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula>
on the old Xeon E5-2680 platform
with the same <italic>precise</italic> compile flag, and the speedups are
improved to 4.45<inline-formula><mml:math id="M143" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 2.44<inline-formula><mml:math id="M144" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> by using the <italic> fast</italic><inline-formula><mml:math id="M145" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1 </italic> compile flag. With the same <italic> fast</italic><inline-formula><mml:math id="M146" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> flag, the MP CBM-Z showed 3.32 and 1.96 times
acceleration compared with the baseline CBM-Z for the gas-phase chemistry module
and whole NAQPMS. Such results illustrate that the optimization for
vectorization improves the potential on existing hardware and the
performance is highly improved even with the relatively strict
<italic>precise</italic> compile flag, which prevents most vectorizations.</p>
      <p id="d1e3381">The new generation CPU further improves the performance of the MP CBM-Z.
Using the platform with the new generation processor Xeon Gold 6132, the
speedups reach 1.84<inline-formula><mml:math id="M147" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 1.66<inline-formula><mml:math id="M148" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> for the CBM-Z and the NAQPMS with the
<italic>precise</italic> compile flag, respectively, and adopting the
<italic>fast</italic><inline-formula><mml:math id="M149" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag improves the
speedups to 8.22<inline-formula><mml:math id="M150" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 3.50<inline-formula><mml:math id="M151" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> compared with the benchmark performance. On the
same Xeon Gold 6132 platform with the <italic>fast</italic><inline-formula><mml:math id="M152" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> compile flag, the MP CBM-Z
gains 4.90 and 2.22 times acceleration compared with the baseline CBM-Z for
the gas-phase chemistry module and the whole NAQPMS. Moreover, the
proportion of time taken by the gas-phase chemistry declined to 30.74 %
compared to 72.26 % in the baseline model.</p>
      <?pagebreak page761?><p id="d1e3439">In addition, the MP CBM-Z extends the benefit gained from advanced hardware.
Using the same <italic>fast</italic> <inline-formula><mml:math id="M153" display="inline"><mml:mrow><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> compile option, the performance of the baseline
CBM-Z on the AVX-512 platform is about 1.25 times that on the AVX-2 platform,
and the performance of the MP CBM-Z is about 1.84 times of that on AVX-2
platform. The efficiency of using the new CPUs improved by about 47 % by
adopting the MP CBM-Z. Therefore, enhancing the vectorization of code
ensures that applications, like the CTM in this paper, could further utilize
the improvement of processors on vectorization in the future.</p>
      <p id="d1e3455">KNL are more reliant on SIMD for performance according to the test results.
The CBM-Z module is accelerated on KNL with a speedup of 5.9<inline-formula><mml:math id="M154" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula>, but the whole
model only achieved a 1.27 times acceleration compared with the benchmark
performance. Comparing the baseline CBM-Z on the Intel Xeon Gold 6132 platform,
the MP CBM-Z achieves a speedup of 3.52<inline-formula><mml:math id="M155" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> for the gas-phase chemistry on KNL; however, the performance of the whole model declined by 24 %. Therefore,
the MP CBM-Z largely improved the efficiency of CBM-Z on KNL by improving
its vectorization, but further optimizations are required for greater
efficiency of the whole CTM on the KNL architecture.</p>
</sec>
</sec>
</sec>
<sec id="Ch1.S4" sec-type="conclusions">
  <title>Conclusions and discussion</title>
      <p id="d1e3480">A new framework was designed for helping the chemical kinetics kernel CBM-Z
to adapt to the next-generation processes by improving its vectorization.
Through packing multiple spatial points, the optimized CBM-Z module handled
these simultaneously. The functions in the original CBM-Z were restructured
with loops, which provided the opportunity to implement the fine-grain level
parallelization of vectorization. Meanwhile, we masked the heterogeneous
grid boxes to integrate the chemistry sub-schemes in the CBM-Z to perform
the calculation of multiple grid boxes simultaneously. Since the contiguous
grid boxes have similar chemistry processes, the impact of this on the
scientific performance was largely limited, and the code was highly
vectorized.</p>
      <p id="d1e3483">The computation cluster equipped with two generation CPUs (Intel Xeon
E5-2680 V4 and Intel Xeon Gold 6132) and KNL (Intel Xeon Phi 7250) provided
by IAP, CAS, were used to test the performance. We tested the code with two
different compile options of <italic>-fp-model</italic> <italic>precise</italic> and
<italic>-fp-model</italic> <italic>fast</italic><inline-formula><mml:math id="M156" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> to present its impact
on the accuracy of single-precision computation and performance. The
validation test ensured the reliability of our optimization on the model
results, and the errors in all diagnostic chemical species caused by the
single float calculations were lower than about 0.025 % after 10 h
integration with the <italic>fast</italic><inline-formula><mml:math id="M157" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option.
Based on the HPC performance characteristic from the Intel
VTune tools on
the Intel Xeon Gold 6132, the GFLOPS of CBM-Z increased from 4.81 to 21.37,
and the vector capacity usage improved from 14.30 % in the baseline CBM-Z
to 89.40 % in the optimized CBM-Z.</p>
      <p id="d1e3518">The tests using the single core showed that the vectorization optimization
led to speedups of 5.16<inline-formula><mml:math id="M158" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 8.97<inline-formula><mml:math id="M159" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> on Intel Xeon E5-2680 V4 and Intel Xeon
Gold 6132 CPUs, respectively, and KNL achieves a speedup of 3.69<inline-formula><mml:math id="M160" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> compared with
the baseline CBM-Z on the Intel Xeon E5-2680 V4 platform. It highlights the
importance of vectorization on the KNL platform. Meanwhile, we also tested
the MPI and OpenMP version of CBM-Z. The speedup on the two generation CPUs
can reach 104.63<inline-formula><mml:math id="M161" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 198.50<inline-formula><mml:math id="M162" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using MPI and
101.02<inline-formula><mml:math id="M163" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 194.60<inline-formula><mml:math id="M164" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using OpenMP, respectively. The speedup on
the KNL node can reach 175.23<inline-formula><mml:math id="M165" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using MPI and 167.45<inline-formula><mml:math id="M166" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> using OpenMP. The
speedup of the optimized CBM-Z is approximately 40 % higher on a one-socket
KNL platform than on a two-socket Broadwell platform and about 13 %–16 % lower
than on a two-socket Skylake platform.</p>
      <p id="d1e3585">The regional CTM NAQPMS was also used to test the practical improvement of
the MP CBM-Z in more realistic scenarios. The baseline and optimized code of
NAQPMS compiled with the <italic>precise</italic> and
<italic>fast</italic><inline-formula><mml:math id="M167" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> options, respectively,<?pagebreak page762?> were
tested on diverse platforms. The model outputs after 72 h simulation were
used to evaluate the error by the code as well as the compile flags. The
difference between the baseline and optimized code are generally in the
range of <inline-formula><mml:math id="M168" display="inline"><mml:mrow><mml:mo>±</mml:mo><mml:mn mathvariant="normal">0.02</mml:mn></mml:mrow></mml:math></inline-formula> ppbv using <italic>precise</italic>. The maximum
discrepancy over the whole map is about 0.166, 0.197, 0.001, and 0.03 ppbv
for <inline-formula><mml:math id="M169" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">NO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M170" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">O</mml:mi><mml:mn mathvariant="normal">3</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M171" display="inline"><mml:mrow class="chem"><mml:msub><mml:mi mathvariant="normal">SO</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, and CO. The
<italic>fast</italic><inline-formula><mml:math id="M172" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option leads to larger errors;
however, computational performance could benefit a lot through adopting this option.</p>
      <p id="d1e3661">The results of the CTM test with the <italic>fast</italic><inline-formula><mml:math id="M173" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option show that the MP CBM-Z
leads to a speedup of 3.32 and 1.96 for the gas-phase chemistry module and
the CTM on the Intel Xeon E5-2680 platform, respectively. Moreover, on the new
Intel Xeon Gold 6132 platform, the MP CBM-Z gains 4.90<inline-formula><mml:math id="M174" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> and 2.22<inline-formula><mml:math id="M175" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedups
for the gas-phase chemistry module and the whole CTM. For the KNL, the MP
CBM-Z enables a 3.52<inline-formula><mml:math id="M176" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> speedup for the gas-phase chemistry module, but the
whole model lost 24.10 % performance compared to the CPU platform due to
the poor performance of other modules. Since this optimization seeks to
improve the utilization of the VPU, the model is more suitable for the new
generation processors adopting the more advanced SIMD technology. The
results of our tests already show that the benefit of updating CPU improved
by about 47 % by using the MP CBM-Z since the optimized code has better
adaptability for the new hardware.</p>
      <p id="d1e3696">In general, the choice of <italic>-fp-model</italic> compile flag decides the balance between
accuracy and performance. According to our test, after using
the <italic> fast</italic><inline-formula><mml:math id="M177" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1 </italic> option, the performance of
the code is largely improved by sacrificing some accuracy. However, the loss
of accuracy is relatively small, and in some practical applications that do
not require high-accuracy floating-point calculations, it is acceptable to
use the <italic>fast</italic><inline-formula><mml:math id="M178" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula><italic>1</italic> option.</p>
      <p id="d1e3724">Besides the CBM-Z chemical scheme, this algorithm is also suitable for
models with a similar code structure to improve its vectorization. In
addition, in this study, CBM-Z was treated as an example to describe this
simple optimization strategy to implement the optimization on new generation
processors, which emphasize the importance of vectorization. However, some
specific strategies should also be considered before adoption. The
optimizing methods such as constructing loops from the discrete scalar
calculations as described in Wang et al. (2017), would diminish the
readability of the source code by using indirect indexing and could cause
problems to subsequent developers. Therefore, it is essential to adopt good
practice, e.g., commenting code well and controlling the compile process, for
ease of maintenance and development.</p>
</sec>

      
      </body>
    <back><notes notes-type="codeavailability">

      <p id="d1e3732">The source code of the baseline and optimized version CBM-Z box model,
including OpenMP and MPI versions, is available online via ZENODO
(<ext-link xlink:href="https://doi.org/10.5281/zenodo.1161576" ext-link-type="DOI">10.5281/zenodo.1161576</ext-link>; Wang et al., 2018).</p>
  </notes><app-group>
        <supplementary-material position="anchor"><p id="d1e3738">The supplement related to this article is available online at: <inline-supplementary-material xlink:href="https://doi.org/10.5194/gmd-12-749-2019-supplement" xlink:title="pdf">https://doi.org/10.5194/gmd-12-749-2019-supplement</inline-supplementary-material>.</p></supplementary-material>
        </app-group><notes notes-type="authorcontribution">

      <p id="d1e3747">QW, HSC, XT, and ZFW planned and organized the project. JL and HW designed
fine-grained parallelization algorithm for CBM-Z module, and QW HSC, XT, and XC
took part in the discussion. JL, HW, HSC, and XT prepared the CBM-Z test
cases and input datasets, and HW, HSC, and QW finished box model validation. XT,
HSC, and ZWF prepared the NAQPMS code and its input dataset, HW coupled
the MP CBM-Z module to NAQPMS, and QW, HW, and XT validated and discussed
the model results. HW and JL analyzed the model performance data, and QW, HQC,
and LW validated its reasonability. HW and QW wrote the manuscript. HW, JL,
QW, XT, HSC, and XC revised the manuscript. HSC, XT, ZFW, XC, HQC, and LW
reviewed and provided key comments on the paper.</p>
  </notes><notes notes-type="competinginterests">

      <p id="d1e3753">The authors declare that they have no conflict of interest.</p>
  </notes><ack><title>Acknowledgements</title><p id="d1e3759">The National Key R&amp;D Program of China (2017YFC0209805 and
2016YFB0200800), the CAS Information Technology Program (XXH13506-302), the
National Natural Science Foundation of China (41305121), and the Fundamental
Research Funds for the Central Universities funded this work. The authors
would like to thank the Institute of Atmospheric Physics and Intel Corporation's
Software Support Group (SSG) for providing the high-performance computing
(HPC) environment and technical support. The authors thank the topic editor
and three anonymous referees for their valuable comments.<?xmltex \hack{\newline}?><?xmltex \hack{\newline}?>
Edited by: Fiona O'Connor<?xmltex \hack{\newline}?>
Reviewed by: three anonymous referees</p></ack><ref-list>
    <title>References</title>

      <ref id="bib1.bib1"><label>1</label><mixed-citation>
Chang, J. S., Brost, R. A., Isaksen, I. S. A., Madronich, S., Middleton, P.,
Stockwell, W. R., and Walcek, C. J.: A three-dimensional Eulerian acid
deposition model: Physical concepts and formulation, J. Geophys.
Res.-Atmos., 92, 14681–14700, 1987.</mixed-citation></ref>
      <ref id="bib1.bib2"><label>2</label><mixed-citation>Chen, H., Wang, Z., Qizhong, W. U., Jianbin, W. U., Yan, P., Tang, X., and
Wang, Z.: Application of Air Quality Multi-Model Forecast System in
Guangzhou: Model Description and Evaluation of PM<inline-formula><mml:math id="M179" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">10</mml:mn></mml:msub></mml:math></inline-formula> Forecast Performance,
Clim. Environ. Res., 18, 427–435, 2013.</mixed-citation></ref>
      <ref id="bib1.bib3"><label>3</label><mixed-citation>Chen, H. S., Wang, Z. F., Li, J., Tang, X., Ge, B. Z., Wu, X. L., Wild, O.,
and Carmichael, G. R.: GNAQPMS-Hg v1.0, a global nested atmospheric mercury
transport model: model description, evaluation and application to
trans-boundary transport of Chinese anthropogenic emissions, Geosci. Model
Dev., 8, 2857–2876, <ext-link xlink:href="https://doi.org/10.5194/gmd-8-2857-2015" ext-link-type="DOI">10.5194/gmd-8-2857-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bib4"><label>4</label><mixed-citation>
Feng, F., Wang, Z., Li, J., and Carmichael, G. R.: A nonnegativity preserved
efficient algorithm for atmospheric chemical kinetic equations, Appl.
Mathe. Comput., 271, 519–531, 2015.</mixed-citation></ref>
      <ref id="bib1.bib5"><label>5</label><mixed-citation>Gao, M., Carmichael, G. R., Wang, Y., Saide, P. E., Yu, M., Xin, J., Liu, Z.,
and Wang, Z.: Modeling study of the 2010 regional haze event in the North
China Plain, Atmos. Chem. Phys., 16, 1673–1691,
<ext-link xlink:href="https://doi.org/10.5194/acp-16-1673-2016" ext-link-type="DOI">10.5194/acp-16-1673-2016</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib6"><label>6</label><mixed-citation>Grell, G. A., Peckham, S. E., Schmitz, R., McKeen, S. A., Frost, G.,
Skamarock, W. C., and Eder, B.: Fully coupled “online” chemistry within
the WRF model, Atmos. Environ., 39, 6957–6975, <ext-link xlink:href="https://doi.org/10.1016/j.atmosenv.2005.04.027" ext-link-type="DOI">10.1016/j.atmosenv.2005.04.027</ext-link>, 2005.</mixed-citation></ref>
      <ref id="bib1.bib7"><label>7</label><mixed-citation>Guenther, A., Karl, T., Harley, P., Wiedinmyer, C., Palmer, P. I., and Geron,
C.: Estimates of global terrestrial isoprene emissions using MEGAN (Model of
Emissions of Gases and Aerosols from Nature), Atmos. Chem. Phys., 6,
3181–3210, <ext-link xlink:href="https://doi.org/10.5194/acp-6-3181-2006" ext-link-type="DOI">10.5194/acp-6-3181-2006</ext-link>, 2006.</mixed-citation></ref>
      <ref id="bib1.bib8"><label>8</label><mixed-citation>Guenther, A. B., Jiang, X., Heald, C. L., Sakulyanontvittaya, T., Duhl, T.,
Emmons, L. K., and Wang, X.: The Model of Emissions of Gases and Aerosols
from Nature version 2.1 (MEGAN2.1): an extended and updated framework for
modeling biogenic emissions, Geosci. Model Dev., 5, 1471–1492,
<ext-link xlink:href="https://doi.org/10.5194/gmd-5-1471-2012" ext-link-type="DOI">10.5194/gmd-5-1471-2012</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bib9"><label>9</label><mixed-citation>Gurjar, B. R., Ravindra, K., and Nagpure, A. S.: Air pollution trends over
Indian megacities and their local-to-global implications, Atmos.
Environ., 142, 475–495, <ext-link xlink:href="https://doi.org/10.1016/j.atmosenv.2016.06.030" ext-link-type="DOI">10.1016/j.atmosenv.2016.06.030</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib10"><label>10</label><mixed-citation>Lawrence, B. N., Rezny, M., Budich, R., Bauer, P., Behrens, J., Carter, M.,
Deconinck, W., Ford, R., Maynard, C., Mullerworth, S., Osuna, C., Porter, A.,
Serradell, K., Valcke, S., Wedi, N., and Wilson, S.: Crossing the chasm: how
to develop weather and climate models for next generation computers?, Geosci.
Model Dev., 11, 1799–1821, <ext-link xlink:href="https://doi.org/10.5194/gmd-11-1799-2018" ext-link-type="DOI">10.5194/gmd-11-1799-2018</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bib11"><label>11</label><mixed-citation>Li, J., Wang, Z., Wang, X., Yamaji, K., Takigawa, M., Kanaya, Y., Pochanart,
P., Liu, Y., Irie, H., Hu, B., Tanimoto, H., and Akimoto, H.: Impacts of
aerosols on summertime tropospheric photolysis frequencies and
photochemistry over Central Eastern China, Atmos. Environ., 45,
1817–1829, <ext-link xlink:href="https://doi.org/10.1016/j.atmosenv.2011.01.016" ext-link-type="DOI">10.1016/j.atmosenv.2011.01.016</ext-link>,
2011.</mixed-citation></ref>
      <ref id="bib1.bib12"><label>12</label><mixed-citation>Li, J., Wang, Z., Zhuang, G., Luo, G., Sun, Y., and Wang, Q.: Mixing of Asian mineral
dust with anthropogenic pollutants over East Asia: a model case study of a
super-duststorm in March 2010, Atmos. Chem. Phys., 12, 7591–7607,
<ext-link xlink:href="https://doi.org/10.5194/acp-12-7591-2012" ext-link-type="DOI">10.5194/acp-12-7591-2012</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bib13"><label>13</label><mixed-citation>Li, J., Wang, Z., Huang, H., Hu, M., Meng, F., Sun, Y., Wang, X., Wang, Y.,
and Wang, Q.: Assessing the effects of trans-boundary aerosol transport
between various city clusters on regional haze episodes in spring over East
China, Tellus B, 65, 20052,
<ext-link xlink:href="https://doi.org/10.3402/tellusb.v65i0.20052" ext-link-type="DOI">10.3402/tellusb.v65i0.20052</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bib14"><label>14</label><mixed-citation>
Linford, J. C., Michalakes, J., Vachharajani, M., and Sandu, A.: Multi-core
acceleration of chemical kinetics for simulation and prediction, in:
Proceedings of the Conference on High Performance Computing Networking,
Storage and Analysis/ACM, 14–20 November 2009, Portland, Oregon, USA, 1–11,
2009.</mixed-citation></ref>
      <ref id="bib1.bib15"><label>15</label><mixed-citation>Mielikainen, J., Huang, B., and Huang, A. H.-L.: Intel Xeon Phi accelerated
Weather Research and Forecasting (WRF) Goddard microphysics scheme, Geosci.
Model Dev. Discuss., 7, 8941–8973, <ext-link xlink:href="https://doi.org/10.5194/gmdd-7-8941-2014" ext-link-type="DOI">10.5194/gmdd-7-8941-2014</ext-link>,
2014.</mixed-citation></ref>
      <ref id="bib1.bib16"><label>16</label><mixed-citation>San José, R., Pérez, J. L., Balzarini, A., Baró, R., Curci, G.,
Forkel, R., Galmarini, S., Grell, G., Hirtl, M., Honzak, L., Im, U.,
Jiménez-Guerrero, P., Langer, M., Pirovano, G., Tuccella, P., Werhahn,
J., and Žabkar, R.: Sensitivity of feedback effects in CBMZ/MOSAIC
chemical mechanism, Atmos. Environ., 115, 646–656,
<ext-link xlink:href="https://doi.org/10.1016/j.atmosenv.2015.04.030" ext-link-type="DOI">10.1016/j.atmosenv.2015.04.030</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bib17"><label>17</label><mixed-citation>
Seinfeld, J. H. and Pandis, S. N.: Atmospheric Chemistry and Physics: From
Air Pollution to Climate Change, 2nd Edition, John Wiley &amp; Sons, New York,
USA, 2012.</mixed-citation></ref>
      <ref id="bib1.bib18"><label>18</label><mixed-citation>Sindelarova, K., Granier, C., Bouarar, I., Guenther, A., Tilmes, S.,
Stavrakou, T., Müller, J.-F., Kuhn, U., Stefani, P., and Knorr, W.: Global
data set of biogenic VOC emissions calculated by the MEGAN model over the
last 30 years, Atmos. Chem. Phys., 14, 9317–9341,
<ext-link xlink:href="https://doi.org/10.5194/acp-14-9317-2014" ext-link-type="DOI">10.5194/acp-14-9317-2014</ext-link>, 2014.</mixed-citation></ref>
      <ref id="bib1.bib19"><label>19</label><mixed-citation>Skamarock, W. C., Klemp, J. B., Dudhia, J., Gill, D. O., Barker, D. M.,
Duda, M. G., Huang, X.-Y., Wang, W., and Powers, J. G.: A description of the
advanced research WRF version 3, NCAR Technical Note NCAR/TN-475<inline-formula><mml:math id="M180" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula>STR,
2008.</mixed-citation></ref>
      <ref id="bib1.bib20"><label>20</label><mixed-citation>
Sodani, A., Gramunt, R., Corbal, J., Kim, H. S., Vinod, K., Chinthamani, S.,
Hutsell, S., Agarwal, R., and Liu, Y. C.: Knights Landing: Second-Generation
Intel Xeon Phi Product, IEEE Micro, 36, 34–46, 2016.</mixed-citation></ref>
      <ref id="bib1.bib21"><label>21</label><mixed-citation>Wang, H., Chen, H., Wu, Q., Lin, J., Chen, X., Xie, X., Wang, R., Tang, X.,
and Wang, Z.: GNAQPMS v1.1: accelerating the Global Nested Air Quality
Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors, Geosci.
Model Dev., 10, 2891–2904, <ext-link xlink:href="https://doi.org/10.5194/gmd-10-2891-2017" ext-link-type="DOI">10.5194/gmd-10-2891-2017</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bib22"><label>22</label><mixed-citation>Wang, H., Lin, J., Wu, Q., Chen, H., Tang, X., Wang, Z.,
Chen, X., and Cheng, H.g: Design a new architecture of CBMZ
gas-phase chemical mechanism for the next generation processors,
<ext-link xlink:href="https://doi.org/10.5281/zenodo.1161576" ext-link-type="DOI">10.5281/zenodo.1161576</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bib23"><label>23</label><mixed-citation>Wang, Y., Chen, H., Wu, Q., Chen, X., Wang, H., Gbaguidi, A., Wang, W., and
Wang, Z.: Three-year, 5 km resolution China PM<inline-formula><mml:math id="M181" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula> simulation: Model
performance evaluation, Atmos. Res., 207, 1–13, <ext-link xlink:href="https://doi.org/10.1016/j.atmosres.2018.02.016" ext-link-type="DOI">10.1016/j.atmosres.2018.02.016</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bib24"><label>24</label><mixed-citation>Wang, Z., Maeda, T., Hayashi, M., Hsiao, L. F., and Liu, K. Y.: A Nested Air
Quality Prediction Modeling System for Urban and Regional Scales:
Application for High-Ozone Episode in Taiwan, Water Air Soil
Pollut., 130, 391–396, <ext-link xlink:href="https://doi.org/10.1023/A:1013833217916" ext-link-type="DOI">10.1023/A:1013833217916</ext-link>, 2001.</mixed-citation></ref>
      <ref id="bib1.bib25"><label>25</label><mixed-citation>
Wu, Q., Wang, Z., Gbaguidi, A., Tang, X., and Zhou, W.: Numerical Study of
The Effect of Traffic Restriction on Air Quality in Beijing, Sola, 6, 17–20,
2010.</mixed-citation></ref>
      <ref id="bib1.bib26"><label>26</label><mixed-citation>Wu, Q. Z., Xu, W. S., Shi, A. J., Li, Y. T., Zhao, X. J., Wang, Z. F., Li, J.
X., and Wang, L. N.: Air quality forecast of PM10 in Beijing with Community
Multi-scale Air Quality Modeling (CMAQ) system: emission and improvement,
Geosci. Model Dev., 7, 2243–2259, <ext-link xlink:href="https://doi.org/10.5194/gmd-7-2243-2014" ext-link-type="DOI">10.5194/gmd-7-2243-2014</ext-link>,
2014.</mixed-citation></ref>
      <ref id="bib1.bib27"><label>27</label><mixed-citation>Xu, S., Huang, X., Oey, L.-Y., Xu, F., Fu, H., Zhang, Y., and Yang, G.:
POM.gpu-v1.0: a GPU-based Princeton Ocean Model, Geosci. Model Dev., 8,
2815–2827, <ext-link xlink:href="https://doi.org/10.5194/gmd-8-2815-2015" ext-link-type="DOI">10.5194/gmd-8-2815-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bib28"><label>28</label><mixed-citation>
Zaveri, R. A. and Peters, L. K.: A new lumped structure photochemical
mechanism for long-scale applications, J. Geophys. Res.-Atmos.,
104, 30387–30415, 1999.</mixed-citation></ref>
      <ref id="bib1.bib29"><label>29</label><mixed-citation>Zhang, Q., Jiang, X., Tong, D., Davis, S. J., Zhao, H., Geng, G., Feng, T.,
Zheng, B., Lu, Z., Streets, D. G., Ni, R., Brauer, M., van Donkelaar, A.,
Martin, R. V., Huo, H., Liu, Z., Pan, D., Kan, H., Yan, Y., Lin, J., He, K.,
and Guan, D.: Transboundary health impacts of transported global<?pagebreak page764?> air
pollution and international trade, Nature, 543, 705–709,
<ext-link xlink:href="https://doi.org/10.1038/nature21712" ext-link-type="DOI">10.1038/nature21712</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bib30"><label>30</label><mixed-citation>ZiFa, W., FuYing, X., XiQuan, W., JunLing, A., and Jiang, Z.: Development
and Application of Nested Air Quality Prediction Modeling System, Chin. J.
Atmos. Sci., 30, 778–790, 2006.
 </mixed-citation></ref><?xmltex \hack{\newpage}?>
      <ref id="bib1.bib31"><label>31</label><mixed-citation>
Zimmermann, J.  and Poppe, D.: A Supplement for the RADM2 Chemical
Mechanism: The Photooxidation of Isoprene, Atmos. Environ., 30,
1255–1269, 1994.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>MP CBM-Z V1.0: design for a new Carbon Bond Mechanism Z (CBM-Z) gas-phase chemical mechanism architecture for next-generation processors</article-title-html>
<abstract-html><p>Precise and rapid air quality simulations and forecasting are
limited by the computational performance of the air quality model used, and
the gas-phase chemistry module is the most time-consuming function in the air
quality model. In this study, we designed a new framework for the widely used
the Carbon Bond Mechanism Z (CBM-Z) gas-phase chemical kinetics kernel to
adapt the single-instruction, multiple-data (SIMD) technology in next-generation
processors to improve its calculation performance. The
optimization implements the fine-grain level parallelization of CBM-Z by
improving its vectorization ability. Through constructing loops and
integrating the main branches, e.g., diverse chemistry sub-schemes, multiple
spatial points in the model can be operated simultaneously on vector
processing units (VPUs). Two generation CPUs – Intel Xeon E5-2680 V4 CPU and
Intel Xeon Gold 6132 – and Intel Xeon Phi 7250 Knights Landing (KNL) are
used as the benchmark processors. The validation of the CBM-Z module outputs
indicates that the relative bias reaches a maximum of 0.025&thinsp;% after 10&thinsp;h
integration with <i>-fp-model fast</i>&thinsp; = 1 compile flag. The results of
the module test show that the Multiple-Points CBM-Z (MP CBM-Z) resulted in
5.16 ×  and 8.97 ×  speedup on a single core of Intel Xeon E5-2680
V4 and Intel Xeon Gold 6132 CPUs, respectively, and KNL had a speedup of
3.69 ×  compared with the performance of CBM-Z on the Intel Xeon E5-2680
V4 platform. For the single-node tests, the speedup on the two generation
CPUs can reach 104.63 ×  and 198.50 ×  using message passing
interface (MPI) and 101.02 ×  and 194.60 ×  using OpenMP, and the
speedup on the KNL node can reach 175.23 ×  using MPI and 167.45 ×  using OpenMP. The speedup of
the optimized CBM-Z is approximately 40&thinsp;% higher on a one-socket KNL
platform than on a two-socket Broadwell platform and about 13&thinsp;%–16&thinsp;%
lower than on a two-socket Skylake platform. We also tested a
three-dimensional chemistry transport model (CTM) named Nested Air Quality
Prediction Model System (NAQPMS) equipped with the MP CBM-Z. The tests
illustrate an obvious improvement on the performance for the CTM after
adopting the MP CBM-Z. The results show that the MP CBM-Z leads to a speedup
of 3.32 and 1.96 for the gas-phase chemistry module and the CTM on the Intel
Xeon E5-2680 platform. Moreover, on the new Intel Xeon Gold 6132 platform,
the MP CBM-Z gains 4.90 ×  and 2.22 ×  speedups for the gas-phase
chemistry module and the whole CTM. For the KNL, the MP CBM-Z enables a
3.52 ×  speedup for the gas-phase chemistry module, but the whole model
lost 24.10&thinsp;% performance compared to the CPU platform due to the poor
performance of other modules. In addition, since this optimization seeks to
improve the utilization of the VPU, the model is more suitable for the new
generation processors adopting the more advanced SIMD technology. The results
of our tests already show that the benefit of updating CPU improved by about
47&thinsp;% by using the MP CBM-Z since the optimized code has better
adaptability for the new hardware. This work improves the performance of the
CBM-Z chemical kinetics kernel as well as the calculation efficiency of the
air quality model, which can directly improve the practical value of the air
quality model in scientific simulations and routine forecasting.</p></abstract-html>
<ref-html id="bib1.bib1"><label>1</label><mixed-citation>
Chang, J. S., Brost, R. A., Isaksen, I. S. A., Madronich, S., Middleton, P.,
Stockwell, W. R., and Walcek, C. J.: A three-dimensional Eulerian acid
deposition model: Physical concepts and formulation, J. Geophys.
Res.-Atmos., 92, 14681–14700, 1987.
</mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>2</label><mixed-citation>
Chen, H., Wang, Z., Qizhong, W. U., Jianbin, W. U., Yan, P., Tang, X., and
Wang, Z.: Application of Air Quality Multi-Model Forecast System in
Guangzhou: Model Description and Evaluation of PM<sub>10</sub> Forecast Performance,
Clim. Environ. Res., 18, 427–435, 2013.
</mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>3</label><mixed-citation>
Chen, H. S., Wang, Z. F., Li, J., Tang, X., Ge, B. Z., Wu, X. L., Wild, O.,
and Carmichael, G. R.: GNAQPMS-Hg v1.0, a global nested atmospheric mercury
transport model: model description, evaluation and application to
trans-boundary transport of Chinese anthropogenic emissions, Geosci. Model
Dev., 8, 2857–2876, <a href="https://doi.org/10.5194/gmd-8-2857-2015" target="_blank">https://doi.org/10.5194/gmd-8-2857-2015</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>4</label><mixed-citation>
Feng, F., Wang, Z., Li, J., and Carmichael, G. R.: A nonnegativity preserved
efficient algorithm for atmospheric chemical kinetic equations, Appl.
Mathe. Comput., 271, 519–531, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>5</label><mixed-citation>
Gao, M., Carmichael, G. R., Wang, Y., Saide, P. E., Yu, M., Xin, J., Liu, Z.,
and Wang, Z.: Modeling study of the 2010 regional haze event in the North
China Plain, Atmos. Chem. Phys., 16, 1673–1691,
<a href="https://doi.org/10.5194/acp-16-1673-2016" target="_blank">https://doi.org/10.5194/acp-16-1673-2016</a>, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>6</label><mixed-citation>
Grell, G. A., Peckham, S. E., Schmitz, R., McKeen, S. A., Frost, G.,
Skamarock, W. C., and Eder, B.: Fully coupled “online” chemistry within
the WRF model, Atmos. Environ., 39, 6957–6975, <a href="https://doi.org/10.1016/j.atmosenv.2005.04.027" target="_blank">https://doi.org/10.1016/j.atmosenv.2005.04.027</a>, 2005.
</mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>7</label><mixed-citation>
Guenther, A., Karl, T., Harley, P., Wiedinmyer, C., Palmer, P. I., and Geron,
C.: Estimates of global terrestrial isoprene emissions using MEGAN (Model of
Emissions of Gases and Aerosols from Nature), Atmos. Chem. Phys., 6,
3181–3210, <a href="https://doi.org/10.5194/acp-6-3181-2006" target="_blank">https://doi.org/10.5194/acp-6-3181-2006</a>, 2006.
</mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>8</label><mixed-citation>
Guenther, A. B., Jiang, X., Heald, C. L., Sakulyanontvittaya, T., Duhl, T.,
Emmons, L. K., and Wang, X.: The Model of Emissions of Gases and Aerosols
from Nature version 2.1 (MEGAN2.1): an extended and updated framework for
modeling biogenic emissions, Geosci. Model Dev., 5, 1471–1492,
<a href="https://doi.org/10.5194/gmd-5-1471-2012" target="_blank">https://doi.org/10.5194/gmd-5-1471-2012</a>, 2012.
</mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>9</label><mixed-citation>
Gurjar, B. R., Ravindra, K., and Nagpure, A. S.: Air pollution trends over
Indian megacities and their local-to-global implications, Atmos.
Environ., 142, 475–495, <a href="https://doi.org/10.1016/j.atmosenv.2016.06.030" target="_blank">https://doi.org/10.1016/j.atmosenv.2016.06.030</a>, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>10</label><mixed-citation>
Lawrence, B. N., Rezny, M., Budich, R., Bauer, P., Behrens, J., Carter, M.,
Deconinck, W., Ford, R., Maynard, C., Mullerworth, S., Osuna, C., Porter, A.,
Serradell, K., Valcke, S., Wedi, N., and Wilson, S.: Crossing the chasm: how
to develop weather and climate models for next generation computers?, Geosci.
Model Dev., 11, 1799–1821, <a href="https://doi.org/10.5194/gmd-11-1799-2018" target="_blank">https://doi.org/10.5194/gmd-11-1799-2018</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>11</label><mixed-citation>
Li, J., Wang, Z., Wang, X., Yamaji, K., Takigawa, M., Kanaya, Y., Pochanart,
P., Liu, Y., Irie, H., Hu, B., Tanimoto, H., and Akimoto, H.: Impacts of
aerosols on summertime tropospheric photolysis frequencies and
photochemistry over Central Eastern China, Atmos. Environ., 45,
1817–1829, <a href="https://doi.org/10.1016/j.atmosenv.2011.01.016" target="_blank">https://doi.org/10.1016/j.atmosenv.2011.01.016</a>,
2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>12</label><mixed-citation>
Li, J., Wang, Z., Zhuang, G., Luo, G., Sun, Y., and Wang, Q.: Mixing of Asian mineral
dust with anthropogenic pollutants over East Asia: a model case study of a
super-duststorm in March 2010, Atmos. Chem. Phys., 12, 7591–7607,
<a href="https://doi.org/10.5194/acp-12-7591-2012" target="_blank">https://doi.org/10.5194/acp-12-7591-2012</a>, 2012.
</mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>13</label><mixed-citation>
Li, J., Wang, Z., Huang, H., Hu, M., Meng, F., Sun, Y., Wang, X., Wang, Y.,
and Wang, Q.: Assessing the effects of trans-boundary aerosol transport
between various city clusters on regional haze episodes in spring over East
China, Tellus B, 65, 20052,
<a href="https://doi.org/10.3402/tellusb.v65i0.20052" target="_blank">https://doi.org/10.3402/tellusb.v65i0.20052</a>, 2013.
</mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>14</label><mixed-citation>
Linford, J. C., Michalakes, J., Vachharajani, M., and Sandu, A.: Multi-core
acceleration of chemical kinetics for simulation and prediction, in:
Proceedings of the Conference on High Performance Computing Networking,
Storage and Analysis/ACM, 14–20 November 2009, Portland, Oregon, USA, 1–11,
2009.
</mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>15</label><mixed-citation>
Mielikainen, J., Huang, B., and Huang, A. H.-L.: Intel Xeon Phi accelerated
Weather Research and Forecasting (WRF) Goddard microphysics scheme, Geosci.
Model Dev. Discuss., 7, 8941–8973, <a href="https://doi.org/10.5194/gmdd-7-8941-2014" target="_blank">https://doi.org/10.5194/gmdd-7-8941-2014</a>,
2014.
</mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>16</label><mixed-citation>
San José, R., Pérez, J. L., Balzarini, A., Baró, R., Curci, G.,
Forkel, R., Galmarini, S., Grell, G., Hirtl, M., Honzak, L., Im, U.,
Jiménez-Guerrero, P., Langer, M., Pirovano, G., Tuccella, P., Werhahn,
J., and Žabkar, R.: Sensitivity of feedback effects in CBMZ/MOSAIC
chemical mechanism, Atmos. Environ., 115, 646–656,
<a href="https://doi.org/10.1016/j.atmosenv.2015.04.030" target="_blank">https://doi.org/10.1016/j.atmosenv.2015.04.030</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>17</label><mixed-citation>
Seinfeld, J. H. and Pandis, S. N.: Atmospheric Chemistry and Physics: From
Air Pollution to Climate Change, 2nd Edition, John Wiley &amp; Sons, New York,
USA, 2012.
</mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>18</label><mixed-citation>
Sindelarova, K., Granier, C., Bouarar, I., Guenther, A., Tilmes, S.,
Stavrakou, T., Müller, J.-F., Kuhn, U., Stefani, P., and Knorr, W.: Global
data set of biogenic VOC emissions calculated by the MEGAN model over the
last 30 years, Atmos. Chem. Phys., 14, 9317–9341,
<a href="https://doi.org/10.5194/acp-14-9317-2014" target="_blank">https://doi.org/10.5194/acp-14-9317-2014</a>, 2014.
</mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>19</label><mixed-citation>
Skamarock, W. C., Klemp, J. B., Dudhia, J., Gill, D. O., Barker, D. M.,
Duda, M. G., Huang, X.-Y., Wang, W., and Powers, J. G.: A description of the
advanced research WRF version 3, NCAR Technical Note NCAR/TN-475+STR,
2008.
</mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>20</label><mixed-citation>
Sodani, A., Gramunt, R., Corbal, J., Kim, H. S., Vinod, K., Chinthamani, S.,
Hutsell, S., Agarwal, R., and Liu, Y. C.: Knights Landing: Second-Generation
Intel Xeon Phi Product, IEEE Micro, 36, 34–46, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>21</label><mixed-citation>
Wang, H., Chen, H., Wu, Q., Lin, J., Chen, X., Xie, X., Wang, R., Tang, X.,
and Wang, Z.: GNAQPMS v1.1: accelerating the Global Nested Air Quality
Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors, Geosci.
Model Dev., 10, 2891–2904, <a href="https://doi.org/10.5194/gmd-10-2891-2017" target="_blank">https://doi.org/10.5194/gmd-10-2891-2017</a>, 2017.
</mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>22</label><mixed-citation>
Wang, H., Lin, J., Wu, Q., Chen, H., Tang, X., Wang, Z.,
Chen, X., and Cheng, H.g: Design a new architecture of CBMZ
gas-phase chemical mechanism for the next generation processors,
<a href="https://doi.org/10.5281/zenodo.1161576" target="_blank">https://doi.org/10.5281/zenodo.1161576</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>23</label><mixed-citation>
Wang, Y., Chen, H., Wu, Q., Chen, X., Wang, H., Gbaguidi, A., Wang, W., and
Wang, Z.: Three-year, 5&thinsp;km resolution China PM<sub>2.5</sub> simulation: Model
performance evaluation, Atmos. Res., 207, 1–13, <a href="https://doi.org/10.1016/j.atmosres.2018.02.016" target="_blank">https://doi.org/10.1016/j.atmosres.2018.02.016</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>24</label><mixed-citation>
Wang, Z., Maeda, T., Hayashi, M., Hsiao, L. F., and Liu, K. Y.: A Nested Air
Quality Prediction Modeling System for Urban and Regional Scales:
Application for High-Ozone Episode in Taiwan, Water Air Soil
Pollut., 130, 391–396, <a href="https://doi.org/10.1023/A:1013833217916" target="_blank">https://doi.org/10.1023/A:1013833217916</a>, 2001.
</mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>25</label><mixed-citation>
Wu, Q., Wang, Z., Gbaguidi, A., Tang, X., and Zhou, W.: Numerical Study of
The Effect of Traffic Restriction on Air Quality in Beijing, Sola, 6, 17–20,
2010.
</mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>26</label><mixed-citation>
Wu, Q. Z., Xu, W. S., Shi, A. J., Li, Y. T., Zhao, X. J., Wang, Z. F., Li, J.
X., and Wang, L. N.: Air quality forecast of PM10 in Beijing with Community
Multi-scale Air Quality Modeling (CMAQ) system: emission and improvement,
Geosci. Model Dev., 7, 2243–2259, <a href="https://doi.org/10.5194/gmd-7-2243-2014" target="_blank">https://doi.org/10.5194/gmd-7-2243-2014</a>,
2014.
</mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>27</label><mixed-citation>
Xu, S., Huang, X., Oey, L.-Y., Xu, F., Fu, H., Zhang, Y., and Yang, G.:
POM.gpu-v1.0: a GPU-based Princeton Ocean Model, Geosci. Model Dev., 8,
2815–2827, <a href="https://doi.org/10.5194/gmd-8-2815-2015" target="_blank">https://doi.org/10.5194/gmd-8-2815-2015</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>28</label><mixed-citation>
Zaveri, R. A. and Peters, L. K.: A new lumped structure photochemical
mechanism for long-scale applications, J. Geophys. Res.-Atmos.,
104, 30387–30415, 1999.
</mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>29</label><mixed-citation>
Zhang, Q., Jiang, X., Tong, D., Davis, S. J., Zhao, H., Geng, G., Feng, T.,
Zheng, B., Lu, Z., Streets, D. G., Ni, R., Brauer, M., van Donkelaar, A.,
Martin, R. V., Huo, H., Liu, Z., Pan, D., Kan, H., Yan, Y., Lin, J., He, K.,
and Guan, D.: Transboundary health impacts of transported global air
pollution and international trade, Nature, 543, 705–709,
<a href="https://doi.org/10.1038/nature21712" target="_blank">https://doi.org/10.1038/nature21712</a>, 2017.
</mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>30</label><mixed-citation>
ZiFa, W., FuYing, X., XiQuan, W., JunLing, A., and Jiang, Z.: Development
and Application of Nested Air Quality Prediction Modeling System, Chin. J.
Atmos. Sci., 30, 778–790, 2006.

</mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>31</label><mixed-citation>
Zimmermann, J.  and Poppe, D.: A Supplement for the RADM2 Chemical
Mechanism: The Photooxidation of Isoprene, Atmos. Environ., 30,
1255–1269, 1994.
</mixed-citation></ref-html>--></article>
