<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="research-article"><?xmltex \makeatother\@nolinetrue\makeatletter?>
  <front>
    <journal-meta><journal-id journal-id-type="publisher">GMD</journal-id><journal-title-group>
    <journal-title>Geoscientific Model Development</journal-title>
    <abbrev-journal-title abbrev-type="publisher">GMD</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Geosci. Model Dev.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1991-9603</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/gmd-14-7411-2021</article-id><title-group><article-title>Machine-learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets</article-title><alt-title>Machine-learning models to replicate large-eddy simulations</alt-title>
      </title-group><?xmltex \runningtitle{Machine-learning models to replicate large-eddy simulations}?><?xmltex \runningauthor{M.~Lange et al.}?>
      <contrib-group>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Lange</surname><given-names>Moritz</given-names></name>
          
        <ext-link>https://orcid.org/0000-0001-7109-7813</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Suominen</surname><given-names>Henri</given-names></name>
          
        <ext-link>https://orcid.org/0000-0001-8814-1040</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Kurppa</surname><given-names>Mona</given-names></name>
          
        <ext-link>https://orcid.org/0000-0003-2538-1068</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2 aff3">
          <name><surname>Järvi</surname><given-names>Leena</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-5224-3448</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Oikarinen</surname><given-names>Emilia</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-9623-6282</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Savvides</surname><given-names>Rafael</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-5796-6274</ext-link></contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff1 aff2">
          <name><surname>Puolamäki</surname><given-names>Kai</given-names></name>
          <email>kai.puolamaki@helsinki.fi</email>
        <ext-link>https://orcid.org/0000-0003-1819-1047</ext-link></contrib>
        <aff id="aff1"><label>1</label><institution>Department of Computer Science, University of Helsinki, Helsinki, Finland</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Institute of Atmospheric and Earth System Research (INAR)/Physics, University of Helsinki, Helsinki, Finland</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>Helsinki Institute of Sustainability Science, University of Helsinki, Helsinki, Finland</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Kai Puolamäki (kai.puolamaki@helsinki.fi)</corresp></author-notes><pub-date><day>2</day><month>December</month><year>2021</year></pub-date>
      
      <volume>14</volume>
      <issue>12</issue>
      <fpage>7411</fpage><lpage>7424</lpage>
      <history>
        <date date-type="received"><day>18</day><month>June</month><year>2020</year></date>
           <date date-type="rev-request"><day>31</day><month>August</month><year>2020</year></date>
           <date date-type="rev-recd"><day>14</day><month>June</month><year>2021</year></date>
           <date date-type="accepted"><day>5</day><month>October</month><year>2021</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2021 Moritz Lange et al.</copyright-statement>
        <copyright-year>2021</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021.html">This article is available from https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021.html</self-uri><self-uri xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021.pdf">The full text article is available as a PDF file from https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d1e149">Running large-eddy simulations (LESs) can be burdensome and computationally too expensive from the application point of view, for example, to support urban planning. In this study, regression models are used to replicate modelled air pollutant concentrations from LES in urban boulevards. We study the performance of regression models and discuss how to detect situations where the models are applied outside their training domain and their outputs cannot be trusted.
Regression models from 10 different model families are trained and a cross-validation methodology is used to evaluate their performance and to find the best set of features needed to reproduce the LES outputs. We also test the regression models on an independent testing dataset.
Our results suggest that in general, log-linear regression gives the best and most robust performance on new independent data. It clearly outperforms the dummy model which would predict constant concentrations for all locations (multiplicative minimum RMSE (mRMSE) of <inline-formula><mml:math id="M1" display="inline"><mml:mn mathvariant="normal">0.76</mml:mn></mml:math></inline-formula> vs. <inline-formula><mml:math id="M2" display="inline"><mml:mn mathvariant="normal">1.78</mml:mn></mml:math></inline-formula> of the dummy model). Furthermore, we demonstrate that it is possible to detect concept drift, i.e. situations where the model is applied outside its training domain and a new LES run may be necessary to obtain reliable results.
Regression models can be used to replace LES simulations in estimating air pollutant concentrations, unless higher accuracy is needed. In order to have reliable results, it is however important to do the model and feature selection carefully to avoid overfitting and to use methods to detect the concept drift.</p>
  </abstract>
    </article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d1e175">Exposure to ambient air pollution leads to cardiovascular and pulmonary diseases, and is estimated to cause 3 million premature deaths worldwide every year <xref ref-type="bibr" rid="bib1.bibx25 bib1.bibx38" id="paren.1"/>, of which 0.8 million occur in Europe <xref ref-type="bibr" rid="bib1.bibx26" id="paren.2"/>. Urban areas are generally characterized not only by high population densities but also higher air pollutant concentration levels compared to rural areas. The degraded air quality particularly in street canyons results from high local emissions, such as traffic combustion near the ground, as well as limited dispersion of these traffic-related pollutants. Streets with traffic are generally flanked with buildings and/or vegetation that inhibit pollutant ventilation upwards from the pedestrian level. Namely, dispersion is the main factor determining air quality <xref ref-type="bibr" rid="bib1.bibx19" id="paren.3"><named-content content-type="pre">e.g.</named-content></xref>. Furthermore, these obstacles block, decelerate and modify air flow, leading to a highly turbulent wind field and pollutant dispersion patterns <xref ref-type="bibr" rid="bib1.bibx5" id="paren.4"/>. Consequently, certain urban planning solutions can be applied to enhance air pollutant dispersion and hence to improve local air quality to some extent <xref ref-type="bibr" rid="bib1.bibx20 bib1.bibx30 bib1.bibx40" id="paren.5"><named-content content-type="pre">see, e.g.</named-content></xref>. This, on the other hand, evokes the need for high-resolution, building-resolving air pollution modelling.</p>
      <p id="d1e197">Successful modelling of urban air pollutant dispersion necessitates taking into account the detailed properties of adjacent buildings and vegetation in the area of interest as well as in its surroundings. To date, high-resolution dispersion modelling has mainly been based on physical modelling<?pagebreak page7412?> techniques, of which computational fluid dynamics (CFD) models, notably Reynolds-averaged Navier–Stokes  (RANS) equations and large-eddy simulation (LES), are the most applicable tools for the purpose. CFD models solve the flow and dispersion around individual buildings, and with constantly increasing computational resources the modelling domains can currently be extended to cover entire neighbourhoods and even cities <xref ref-type="bibr" rid="bib1.bibx3" id="paren.6"/>. However, conducting reliable and high-quality CFD simulations requires expertise in model application. Moreover, for building-resolving simulations that apply a grid resolution of 2 <inline-formula><mml:math id="M3" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula> or finer, especially LES necessitates supercomputing resources. At the same time, LES has been found more accurate than RANS in solving finer-scale details <xref ref-type="bibr" rid="bib1.bibx35 bib1.bibx36" id="paren.7"/> and therefore particularly suitable in modelling flow and pollutant concentrations within real urban environments. In contrast to CFD, statistical models based on machine learning may offer a significantly less expensive alternative to predict urban air quality and pollutant dispersion. Consequently, the number of studies conducting machine-learning-based air quality modelling has increased rapidly <xref ref-type="bibr" rid="bib1.bibx34" id="paren.8"/>.</p>
      <p id="d1e217">Machine learning allows finding a relationship between a target variable, e.g. the concentration of air pollutants in a certain location, and its predictors, which are often called features. These types of machine-learning models are called regression models. The models are trained on a specified training data which by some rule, e.g. maximizing the likelihood given that the relationship is linear with normally distributed noise, observe a relationship between the target variable and its features. To evaluate the trustworthiness of the results, often a part of the available data is not used to train the model but to evaluate it. These “unseen” evaluation data
give an estimate of the model performance in a realistic urban planning scenario. Perhaps the largest advantage of regression models compared to the CFD models is their speed. However, the increase in speed comes at a cost of accuracy. Another disadvantage is that accurate predictions require the predicted data to follow approximately the same distribution as the training data, reducing their use to only  modelling setups similar to those that they have been trained in.</p>
      <p id="d1e220">Most of the previous studies on developing a statistical air pollution model using machine learning have been based on field measurements, and the spatiotemporal distribution of pollutants has been assessed by utilizing multiple stationary sites in model training <xref ref-type="bibr" rid="bib1.bibx2 bib1.bibx39" id="paren.9"><named-content content-type="pre">e.g.</named-content></xref>. To further improve the spatial resolution of modelling in urban areas, also mobile air quality measurements have been utilized <xref ref-type="bibr" rid="bib1.bibx1 bib1.bibx13 bib1.bibx18 bib1.bibx37" id="paren.10"><named-content content-type="pre">e.g.</named-content></xref>. However, due to spatial accuracy constraints, the spatial resolution applied has been limited to the order of 10–50 <inline-formula><mml:math id="M4" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula>, which does not allow investigating the impact of individual buildings on pollutant dispersion. Output data from a numerical model have been used for training but only in the regional scale with a 5–15 <inline-formula><mml:math id="M5" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">km</mml:mi></mml:mrow></mml:math></inline-formula> resolution <xref ref-type="bibr" rid="bib1.bibx7 bib1.bibx32" id="paren.11"/>. Machine-learning studies applying LES data have so far been mainly restricted to turbulence closure modelling <xref ref-type="bibr" rid="bib1.bibx17" id="paren.12"><named-content content-type="pre">e.g.</named-content></xref>.</p>
      <p id="d1e259">In this study, the application of machine learning for emulating LES outputs of local-scale air pollutant dispersion in urban areas is investigated. We use LES outputs from two different studies conducted in different boulevard-type street canyons in Helsinki. Specifically, we create appropriate features for the neighbourhoods around the street canyons from the inputs of the LES that are used to train the machine-learning models and then evaluate the performance and reliability of different algorithms. The motivation is to approximate the computationally expensive simulations with machine-learning models that are faster to evaluate. The ultimate goal is to develop a model that can easily be applied to support urban planning.</p>
      <p id="d1e262">This study is structured as follows: first, the LES datasets and feature construction are described. Then brief descriptions of the used machine-learning models and the training and evaluation process are provided. Finally, the applications, limitations and future work are discussed.</p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Methods and material</title>
      <p id="d1e273">In this chapter, we introduce the LES datasets used in this study and explain our pre-processing steps. Then, we present the features available to the regression models. Finally, we describe how the optimal set of features is chosen by forward feature selection and describe the used performance measures.
All analyses were carried out in R version 3.6.2 <xref ref-type="bibr" rid="bib1.bibx33" id="paren.13"/>.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F1" specific-use="star"><?xmltex \currentcnt{1}?><?xmltex \def\figurename{Figure}?><label>Figure 1</label><caption><p id="d1e281">Simulation domains of the LES output data applied in the study: <bold>(a)</bold> four city-planning alternatives (V1–V4) investigated in KU18 <xref ref-type="bibr" rid="bib1.bibx20" id="paren.14"/> and <bold>(b)</bold> city-boulevard scenario S1 and its surroundings studied in KA20 <xref ref-type="bibr" rid="bib1.bibx14" id="paren.15"/>. Green dots illustrate trees. The city boulevard is 54  and 58 <inline-formula><mml:math id="M6" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula> wide in KU18 and KA20, respectively.</p></caption>
        <?xmltex \igopts{width=369.885827pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f01.png"/>

      </fig>

<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>Large-eddy simulation datasets</title>
      <p id="d1e317">LES models resolve the three-dimensional prognostic equations for momentum and scalar variables. In LES, all turbulence scales larger than a chosen filter width are resolved directly. The smaller scales, which should represent less than 10 % of the turbulence energy <xref ref-type="bibr" rid="bib1.bibx12" id="paren.16"/>, are parameterized using a subgrid-scale model. This study uses output data from two studies to train and evaluate the regression methods: <xref ref-type="bibr" rid="bib1.bibx20" id="text.17"/> and <xref ref-type="bibr" rid="bib1.bibx14" id="text.18"/>. Both studies apply the LES model PALM  <xref ref-type="bibr" rid="bib1.bibx27 bib1.bibx28" id="paren.19"/> to assess the impact of city planning on the pedestrian-level air quality.</p>
      <?pagebreak page7413?><p id="d1e332">The first study by <xref ref-type="bibr" rid="bib1.bibx20" id="text.20"><named-content content-type="post">model revision 1904, hereafter KU18</named-content></xref> investigates the impact of city-block orientation and variation in the building height and shape on the dispersion of traffic-related air pollutants. Specifically, LES is run over a 54 <inline-formula><mml:math id="M7" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula>-wide city boulevard applying four alternative city-planning solutions (V1–V4 in Fig. <xref ref-type="fig" rid="Ch1.F1"/>a). Here, simulations with a neutral atmospheric stratification and wind from east (90<inline-formula><mml:math id="M8" display="inline"><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup></mml:math></inline-formula>) and south-west (225<inline-formula><mml:math id="M9" display="inline"><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup></mml:math></inline-formula>) are considered. Air pollutant dispersion is studied by a Lagrangian particle model embedded in PALM. Air pollutants are represented as inert particles that are released above streets with traffic and follow the air flow without interacting with any surface. In the present study, 40 min averaged concentration fields from KU18 are employed.</p>
      <p id="d1e368">The second study by <xref ref-type="bibr" rid="bib1.bibx14" id="text.21"><named-content content-type="post">model revision 3698, hereafter KA20</named-content></xref> assesses the impact of street-tree layout on the concentrations of traffic-related aerosol particles. The study is conducted over a 50 to 58 <inline-formula><mml:math id="M10" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula>-wide city boulevard (Fig. <xref ref-type="fig" rid="Ch1.F1"/>b) with neutral atmospheric stratification under the two wind directions: parallel and perpendicular to the boulevard. Contrary to KU18, aerosol particle concentrations and size distributions as well as aerosol dry deposition on surfaces and vegetation are explicitly modelled by applying an aerosol module embedded in PALM <xref ref-type="bibr" rid="bib1.bibx21" id="paren.22"/>. The aerosol module applies the Eulerian approach. Here, 1 h averaged concentration fields of PM<inline-formula><mml:math id="M11" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula> (particulate matter with aerodynamic diameter <inline-formula><mml:math id="M12" display="inline"><mml:mrow><mml:mo>&lt;</mml:mo><mml:mn mathvariant="normal">2.5</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M13" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula>) for one modelling scenario (S1) are applied.</p>
      <p id="d1e419">Both studies apply a grid spacing of 1.0 <inline-formula><mml:math id="M14" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula> in horizontal and 0.75–1.0 <inline-formula><mml:math id="M15" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula> in vertical to directly resolve the most relevant turbulent structures related to buildings and vegetation. The surface description is given to PALM by maps: namely those of topography elevation, tree height and emission strength per area. The simulations were conducted in a supercomputing environment and each simulation took approximately <inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mn mathvariant="normal">3</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> d of CPU time.</p>
      <p id="d1e450">The machine-learning models in this study are trained with KU18 and evaluated with KA20. Using KA20 for evaluation mimics a realistic urban planning scenario, where KA20 would correspond to a new city plan considered by the urban planner.
KU18 is selected as the training dataset due to greater variety in building layouts compared to KA20, in which the variation is mainly limited to different street-tree scenarios. Clear differences in the building layouts lead to deviant pollutant dispersion and concentration distributions, which improves the generalization performance. An alternative approach for training would entail creating  training and evaluation data using random samples from both KU18 and KA20. However, random sampling is impractical in this case due to significant differences between KU18 and KA20. KU18 simulates dispersion qualitatively and assumes weightless and inert particles that imitate air pollutants in general. In contrast, KA20 models realistic aerosol particle concentration values and includes realistic model physics for the aerosol dry deposition. These differences affect the scaling and distributions of the simulated particle concentrations (Fig. <xref ref-type="fig" rid="Ch1.F2"/>), which is further discussed in Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/> and <xref ref-type="sec" rid="Ch1.S3.SS3"/>.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F2" specific-use="star"><?xmltex \currentcnt{2}?><?xmltex \def\figurename{Figure}?><label>Figure 2</label><caption><p id="d1e461">Histograms of air pollutant concentrations in KU18 and KA20. Note the log scale of the counts on the <inline-formula><mml:math id="M17" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula> axis and the differing scales on the <inline-formula><mml:math id="M18" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula> axis. KA20 has units <inline-formula><mml:math id="M19" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">kg</mml:mi><mml:mspace width="0.125em" linebreak="nobreak"/><mml:msup><mml:mi mathvariant="normal">m</mml:mi><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>.</p></caption>
          <?xmltex \igopts{width=455.244094pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f02.png"/>

        </fig>

</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Data pre-processing</title>
      <p id="d1e510">The LES outputs KU18 and KA20 are pre-processed into a suitable format for training the regression models. The pre-processing is divided into two parts: aggregating the target variable over time and height, and constructing expressive features from the LES inputs.</p>
<sec id="Ch1.S2.SS2.SSS1">
  <label>2.2.1</label><title>Target variable</title>
      <p id="d1e520">In regression, the target variable is predicted using predictor variables (features). Here, the target variable is the (air) pollutant concentration <inline-formula><mml:math id="M20" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula>. The raw LES outputs contain <inline-formula><mml:math id="M21" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> on a spatiotemporal grid <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.
We pre-process the target variable by averaging <inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> over time and considering
a fixed height (at which pollutants are emitted). Then, <inline-formula><mml:math id="M24" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> values are predicted on the <inline-formula><mml:math id="M25" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> grid instead of the <inline-formula><mml:math id="M26" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> grid. This simplifies the modelling task significantly while still retaining the relevant information on <inline-formula><mml:math id="M27" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> at the pedestrian level.</p>
      <p id="d1e638">In addition to time averaging, we also restrict the spatial<?pagebreak page7414?> extent in which the regression models are trained. The area of interest in this study is the city boulevard in the middle of the maps, which is the same as in the original studies that generated KU18 and KA20. In KA20, the boulevard is surrounded by artificial buildings to imitate the aerodynamic roughness of an suburban environment, which however are of no interest and are therefore omitted in the model development. Figure <xref ref-type="fig" rid="Ch1.F3"/> presents the exact areas used for modelling.
The spatiotemporal aggregate is calculated by temporally averaging the LES data for the first 100 s displaying stable behaviour. Furthermore, only one vertical level is considered: 4  and 0.88 <inline-formula><mml:math id="M28" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula> above ground level for KU18 and KA20, respectively. This exact vertical level is chosen since it is the one at which the studied air pollutants are released in their respective simulations. In order to make the datasets more comparable, the background concentrations in KA20 are removed using the transformation <inline-formula><mml:math id="M29" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:msub><mml:mi>c</mml:mi><mml:mtext>new</mml:mtext></mml:msub><mml:mspace width="0.125em" linebreak="nobreak"/></mml:mrow></mml:math></inline-formula>=<inline-formula><mml:math id="M30" display="inline"><mml:mrow><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi>p</mml:mi><mml:msub><mml:mi>c</mml:mi><mml:mtext>old</mml:mtext></mml:msub><mml:mo>-</mml:mo><mml:mo>min⁡</mml:mo><mml:mo>(</mml:mo><mml:mi>p</mml:mi><mml:msub><mml:mi>c</mml:mi><mml:mtext>old</mml:mtext></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
</sec>
<sec id="Ch1.S2.SS2.SSS2">
  <label>2.2.2</label><title>Features</title>
      <p id="d1e702">The target variable is predicted using features that are defined for each <inline-formula><mml:math id="M31" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo>×</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> surface pixel in the modelling domain. The features include direct inputs to the LES and surrogate features which are constructed from these inputs. The direct LES inputs are “height of the topography”, which includes solid obstacles such as buildings, “height of the canopy” and “amount of pollutant emissions”. The features used for training the regression models are described briefly in Table <xref ref-type="table" rid="Ch1.T1"/> and in more detail in Table <xref ref-type="table" rid="App1.Ch1.S1.T5"/>.</p>
      <p id="d1e729">The surrogate features provide information about spatial dependencies in the modelling domain, reducing the need for the regression models to explicitly model these dependencies. For instance, applying convolutions of various sizes on pollutant emissions creates surrogate features of average pollutant emission densities over a spatial neighbourhood, weighted by proximity to the point in question. Other features, such as the height-to-width ratio of a street canyon, are created based on domain knowledge. Incorporating domain knowledge is important, since well-crafted input features largely determine the quality of modelling results. Furthermore, understandable features aid experts in interpreting the physical meaning of the results.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T1" specific-use="star"><?xmltex \currentcnt{1}?><label>Table 1</label><caption><p id="d1e735">List of features used in the regression models.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="2">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Feature name</oasis:entry>
         <oasis:entry colname="col2">Description</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Building height</oasis:entry>
         <oasis:entry colname="col2">Height of the closest building</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Canopy height</oasis:entry>
         <oasis:entry colname="col2">Height of the vegetation canopy</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Courtyard</oasis:entry>
         <oasis:entry colname="col2">Binary variable indicating presence of a courtyard</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Direction of closest building</oasis:entry>
         <oasis:entry colname="col2">Direction to the closest building relative to the wind direction</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Distance to building downwind</oasis:entry>
         <oasis:entry colname="col2">Distance to closest building in the same direction as the wind</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Distance to building upwind</oasis:entry>
         <oasis:entry colname="col2">Distance to closest building in the direction against the wind</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Height to width ratio</oasis:entry>
         <oasis:entry colname="col2">Height of the closest building relative to the width of the street</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Pollutant emissions</oasis:entry>
         <oasis:entry colname="col2">Emission level</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Pollutant emissions convolution, <inline-formula><mml:math id="M32" display="inline"><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi><mml:mspace linebreak="nobreak" width="0.125em"/></mml:mrow></mml:math></inline-formula>=<inline-formula><mml:math id="M33" display="inline"><mml:mrow><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">4</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">8</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">16</mml:mn><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col2">Gaussian convolution of the emissions with standard deviation <inline-formula><mml:math id="M34" display="inline"><mml:mi mathvariant="italic">σ</mml:mi></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Pollutant emissions convolution upwind, <inline-formula><mml:math id="M35" display="inline"><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi><mml:mspace width="0.125em" linebreak="nobreak"/></mml:mrow></mml:math></inline-formula>=<inline-formula><mml:math id="M36" display="inline"><mml:mrow><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">8</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">16</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">32</mml:mn><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col2">Convolution of the emissions upwind with standard deviation <inline-formula><mml:math id="M37" display="inline"><mml:mi mathvariant="italic">σ</mml:mi></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Street</oasis:entry>
         <oasis:entry colname="col2">Binary variable indicating presence of a street</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Street width</oasis:entry>
         <oasis:entry colname="col2">Width of the street</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <?xmltex \floatpos{t}?><fig id="Ch1.F3" specific-use="star"><?xmltex \currentcnt{3}?><?xmltex \def\figurename{Figure}?><label>Figure 3</label><caption><p id="d1e954">The map area cutout for which pollutant concentrations are modelled (in blue) for each of the city plans (V1–V4) from <xref ref-type="bibr" rid="bib1.bibx20" id="text.23"/> and for the map area from <xref ref-type="bibr" rid="bib1.bibx14" id="text.24"/> (rightmost picture).</p></caption>
            <?xmltex \igopts{width=398.338583pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f03.png"/>

          </fig>

</sec>
</sec>
<sec id="Ch1.S2.SS3">
  <label>2.3</label><title>Forward feature selection</title>
      <p id="d1e978">After calculating all potentially useful input features, a subset of features to be used with each model is selected using forward selection <xref ref-type="bibr" rid="bib1.bibx11" id="paren.25"/>. Forward selection is a feature selection algorithm in which a model is trained iteratively with progressively larger feature subsets.
Initially, every feature is used as a single predictor to train the model, and the best-performing feature is selected. The model is then re-trained using this feature along with every other feature as a second predictor, and the second-best-performing  feature is selected. This process is repeated until either all features are selected or no additional feature improves the model.
Forward selection limits the search space of all possible feature combinations considerably, and while it does not guarantee to find the globally best subset of features, it finds a local optimum. An advantage of forward selection that is relevant to this study is that it avoids selecting strongly correlated predictors, such as the different-sized convolutions of pollutant emissions (Table <xref ref-type="table" rid="Ch1.T1"/>).</p>
      <?pagebreak page7415?><p id="d1e986">The best-performing features are selected according to a selection criterion. Here, the selection criterion is the cross-validation root mean squared error (RMSE). Cross-validation is a technique for evaluating generalization performance of statistical models and for detecting overfitting. In cross-validation, the data are split into <inline-formula><mml:math id="M38" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> random subsets, out of which <inline-formula><mml:math id="M39" display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> are used to train the model, and one is used to validate the model. Due to the spatial auto-correlation in the LES data, a random split by sampling data points from all maps does not lead to statistically independent subsets. In order to ensure maximal independence between the training and validation data, the random split is performed at the level of city blocks.
Each cross-validation split is trained with three different city plans under a given wind direction and validated with the fourth city plan under the other wind direction. This is repeated for all four city plans using both wind directions. As an example, one split uses city plans V1, V2 and V3 with the wind direction 90<inline-formula><mml:math id="M40" display="inline"><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup></mml:math></inline-formula> as training data and V4 with a wind direction of 225<inline-formula><mml:math id="M41" display="inline"><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup></mml:math></inline-formula> as validation data to validate predictions. Using each city plan for a given wind direction as unique validation data results in eight splits. The aggregated cross-validation error is then calculated as the error of all combined predictions of all splits.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T2" specific-use="star"><?xmltex \currentcnt{2}?><label>Table 2</label><caption><p id="d1e1029">Models and their implementation in R (version 3.6.2.) with all but two model pollutant concentrations (<inline-formula><mml:math id="M42" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula>). For log-linear regression and logarithmic support vector regression, the model internally estimates log(<inline-formula><mml:math id="M43" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula>), where <inline-formula><mml:math id="M44" display="inline"><mml:mrow><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> is added because of the zero values contained within <inline-formula><mml:math id="M45" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula>. The predictions are transformed back to the original scale after being computed.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Model name</oasis:entry>
         <oasis:entry colname="col2">Description</oasis:entry>
         <oasis:entry colname="col3">Implementation</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Decision tree</oasis:entry>
         <oasis:entry colname="col2">Hierarchical model separating data based on rules</oasis:entry>
         <oasis:entry colname="col3">rpart</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Gaussian process</oasis:entry>
         <oasis:entry colname="col2">A Bayesian kernel-based method for regression</oasis:entry>
         <oasis:entry colname="col3">kernlab</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Gradient boosting</oasis:entry>
         <oasis:entry colname="col2">Ensemble method of decision trees</oasis:entry>
         <oasis:entry colname="col3">xgboost</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Linear regression</oasis:entry>
         <oasis:entry colname="col2">Ordinary least squares linear regression</oasis:entry>
         <oasis:entry colname="col3">lm</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Log-linear regression</oasis:entry>
         <oasis:entry colname="col2">Linear regression modelling <inline-formula><mml:math id="M46" display="inline"><mml:mrow><mml:mi>log⁡</mml:mi><mml:mo>(</mml:mo><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col3">lm</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Poisson regression</oasis:entry>
         <oasis:entry colname="col2">Linear regression assuming Poisson distributed data</oasis:entry>
         <oasis:entry colname="col3">glm</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Random forest</oasis:entry>
         <oasis:entry colname="col2">Ensemble method of decision trees</oasis:entry>
         <oasis:entry colname="col3">randomForest</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Support vector regression</oasis:entry>
         <oasis:entry colname="col2">Non-linear kernel-based regression method</oasis:entry>
         <oasis:entry colname="col3">e1071</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Logarithmic support vector regression</oasis:entry>
         <oasis:entry colname="col2">Support vector regression modelling <inline-formula><mml:math id="M47" display="inline"><mml:mrow><mml:mi>log⁡</mml:mi><mml:mo>(</mml:mo><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col3">e1071</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Zero-inflated Poisson regression</oasis:entry>
         <oasis:entry colname="col2">Combination of logistic regression and Poisson regression</oasis:entry>
         <oasis:entry colname="col3">pscl</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S2.SS4">
  <label>2.4</label><title>Model descriptions</title>
      <p id="d1e1272">The applicability of 10 common regression models trained on KU18 is examined, from the simplest linear model to the powerful support vector regression (SVR) model (Table <xref ref-type="table" rid="Ch1.T2"/>).</p>
      <p id="d1e1277"><def-list>
            <def-item><term>Linear models</term><def>

      <p id="d1e1285">Four generalized linear models are considered: linear regression, log-linear regression <xref ref-type="bibr" rid="bib1.bibx4" id="paren.26"/>, Poisson regression and zero-inflated Poisson regression <xref ref-type="bibr" rid="bib1.bibx23" id="paren.27"/>. Generalized linear models model the target variable as a function of linear combinations of features. Linear models are relatively simple, which limits their flexibility but makes them more interpretable. For example, if the features are normalized, then the regression coefficient of a feature communicates how much a change of one unit affects the mean of the target variable, given that all other features are constant.</p>
            </def></def-item>
            <def-item><term>Tree-based models</term><def>

      <p id="d1e1300">Three tree-based models are considered: decision trees, random forest and gradient boosting with decision trees <xref ref-type="bibr" rid="bib1.bibx11" id="paren.28"/>. Decision trees model the target variable using simple if–else rules, which makes them interpretable. Random<?pagebreak page7416?> forest is an ensemble method that aggregates the predictions of multiple decision trees trained on random subsets of data and features. Gradient boosting is also an ensemble method in which multiple decision trees are trained sequentially on the results of previous trees, correcting their weaknesses. Ensemble methods achieve high prediction accuracy by pooling the predictions of multiple models. In particular, the random forest is among the best-performing regression models <xref ref-type="bibr" rid="bib1.bibx8" id="paren.29"/>. However, the complexity of ensemble models means that they are not interpretable.</p>
            </def></def-item>
            <def-item><term>Support vector regression</term><def>

      <p id="d1e1315">SVR <xref ref-type="bibr" rid="bib1.bibx6" id="paren.30"/> is a powerful regression method that implicitly transforms the data into a higher-dimensional feature space. This enables SVR to model interactions and conditional dependencies between features and hence utilize more information compared to simpler models. However, as with the complex tree models, the complexity of SVR leads to difficulties in understanding how the relations in the data are utilized by the model to estimate <inline-formula><mml:math id="M48" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula>.  In addition to standard SVR, a log-transformed SVR (log-SVR) is also used to enforce positive predictions for <inline-formula><mml:math id="M49" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula>.</p>
            </def></def-item>
            <def-item><term>Gaussian process regression</term><def>

      <p id="d1e1347">GPR <xref ref-type="bibr" rid="bib1.bibx29" id="paren.31"/> is a non-parametric approach to regression. GPR is a Bayesian method that uses a Gaussian process with a known covariance as a prior to infer the posterior predictive distribution of the unobserved values. It can be considered as a Bayesian alternative for other kernel methods, such as SVR <xref ref-type="bibr" rid="bib1.bibx29" id="paren.32"/>. We use a squared exponential kernel as the covariance matrix for the model.
This results in points having similar predicted values if they are close to each other in the feature space.
The generally good performance of the GPR is also complemented by its built-in ability to account for uncertainty.</p>
            </def></def-item>
            <def-item><term>Dummy model</term><def>

      <p id="d1e1362">A dummy model is used as a baseline model for reference. The dummy model simply predicts mean <inline-formula><mml:math id="M50" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> of the training data, regardless of any features.</p>
            </def></def-item>
          </def-list></p>
</sec>
<sec id="Ch1.S2.SS5">
  <label>2.5</label><title>Performance measure</title>
      <p id="d1e1385">The RMSE is a standard performance measure for model evaluation <xref ref-type="bibr" rid="bib1.bibx34" id="paren.33"/>. Here, a modified version of RMSE is used, due to the different scales of KU18 and KA20. We define the multiplicative minimum RMSE (mRMSE) based on a linear transformation of the predictions as
<inline-formula><mml:math id="M51" display="inline"><mml:mrow><mml:mi mathvariant="normal">mRMSE</mml:mi><mml:mo>(</mml:mo><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mo>)</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/></mml:mrow></mml:math></inline-formula>=<inline-formula><mml:math id="M52" display="inline"><mml:mrow><mml:mspace linebreak="nobreak" width="0.125em"/><mml:msub><mml:mo>min⁡</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>∈</mml:mo><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:msub><mml:mi mathvariant="normal">RMSE</mml:mi><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>⋅</mml:mo><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>,
where <inline-formula><mml:math id="M53" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> is a vector of the observed pollutant concentrations, and <inline-formula><mml:math id="M54" display="inline"><mml:mover accent="true"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover></mml:math></inline-formula> is a vector with the corresponding predictions. Using mRMSE is equivalent to using RMSE after scaling pollutant concentrations with a multiplicative factor <inline-formula><mml:math id="M55" display="inline"><mml:mi>a</mml:mi></mml:math></inline-formula> that minimizes the RMSE.</p>
      <p id="d1e1490">mRMSE therefore depends only on the relative magnitudes of the pollutant concentrations and it is invariant to linear scaling of the training or evaluation data. For a new evaluation dataset, we could either use the same multiplicative constant – if the scaling in the new evaluation dataset is expected to be identical to the scaling in the old evaluation data – or find a new multiplicative constant.</p>
      <p id="d1e1493">In addition to mRMSE, two other performance measures are presented in the results tables. The cross-validation RMSE (cv-RMSE, described in Sect. <xref ref-type="sec" rid="Ch1.S2.SS3"/>) is presented as a performance measure on the training data, when using the optimal set of features selected with forward selection. Scaled bias is defined as <inline-formula><mml:math id="M56" display="inline"><mml:mrow><mml:mi mathvariant="normal">mean</mml:mi><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>⋅</mml:mo><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M57" display="inline"><mml:mi>a</mml:mi></mml:math></inline-formula> is the same multiplicative factor as in mRMSE. Scaled bias shows whether models over- or underestimate the pollutant concentrations.</p>
      <p id="d1e1535">Note that model performance on spatial data cannot be meaningfully summarized into a single performance measure. The spatial distribution of prediction errors can be examined using residual plots (Fig. <xref ref-type="fig" rid="Ch1.F5"/>).</p>
</sec>
</sec>
<?pagebreak page7417?><sec id="Ch1.S3">
  <label>3</label><title>Experiments</title>
      <p id="d1e1549">This chapter describes the training process of the models with features selected using forward selection. Based on the training results, the best-performing regression models are selected and summarized. The models are subsequently evaluated on an independent dataset (KA20). Lastly, the reliability of model predictions is assessed for cases when LES results are not available with a concept drift detection algorithm.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F4" specific-use="star"><?xmltex \currentcnt{4}?><?xmltex \def\figurename{Figure}?><label>Figure 4</label><caption><p id="d1e1554">Feature selection results on KU18 for all regression models. </p></caption>
        <?xmltex \igopts{width=369.885827pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f04.png"/>

      </fig>

<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Model training</title>
      <p id="d1e1570">For each model, forward feature selection on KU18 provides the optimal set of features (as described in Sect. <xref ref-type="sec" rid="Ch1.S2.SS3"/>). Figure <xref ref-type="fig" rid="Ch1.F4"/> shows the results of feature selection. Features are iteratively added (<inline-formula><mml:math id="M58" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula> axis) and cv-RMSE is computed (<inline-formula><mml:math id="M59" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula> axis). For each model, the features are selected such that cv-RMSE is minimized. In order to limit the computation time of feature selection, SVR and GPR were trained on a random subset of 2500 data points (out of the total 472 991), rather than the whole cross-validation split.</p>
      <p id="d1e1591">After feature selection, each model is trained on the whole training data (all KU18 city plans), using the optimal features from feature selection. SVR and GPR are, as in feature selection, only trained on 2500 randomly selected data points.</p>
      <p id="d1e1594">SVR and GPR have additional parameters whose optimal values were computed using grid search. Grid search is a model tuning technique in which the model is trained using all parameter values on a discrete grid of the parameter space. The selection criterion for the optimal parameter values was cv-RMSE (as with feature selection).</p>
      <p id="d1e1597">Note that some models allow for negative predictions (negative pollutant concentrations), which is physically impossible. A possible approach for avoiding negative predictions is either to use a transformation that allows only positive predictions (for example, log-linear regression instead of linear regression) or to clip model outputs to a minimum of zero after prediction. For the purposes of this study, however, negative predictions are retained, since their magnitude is relevant for error estimation.</p>
</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Model selection</title>
      <p id="d1e1609">Out of the 10 models described in Sect. <xref ref-type="sec" rid="Ch1.S2.SS4"/>, three are selected as the best performing for replicating the LES outputs: logarithmic support vector regression, Gaussian process regression and log-linear regression. Table <xref ref-type="table" rid="Ch1.T3"/> compares the three selected models. Table <xref ref-type="table" rid="Ch1.T4"/> lists the performance of all 10 models with respect to cross-validation RMSE, mRMSE and scaled bias (described in Sect. <xref ref-type="sec" rid="Ch1.S2.SS5"/>).
Performance evaluation at this stage is based on the cross-validation errors computed during forward feature selection on KU18. Final model evaluation on an independent dataset (KA20) is performed in Sect. <xref ref-type="sec" rid="Ch1.S3.SS3"/>.</p>
      <p id="d1e1622">The three best-performing models are as follows:<def-list>
            <def-item><term>Logarithmic SVR</term><def>

      <p id="d1e1631">The logarithmic SVR has the smallest cross-validation RMSE, and as such, its predictions are the most accurate on the training data. Additionally, it only requires a small number of features to achieve this strong performance. Requiring low-dimensional training data means that the user will need to provide fewer features with their data, minimizing the expense of data preparation. A third advantage is that the log transformation ensures non-negative predictions for <inline-formula><mml:math id="M60" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula>.
Although the standard SVR does not ensure non-negative predictions, it requires the same number of features and offers almost the same RMSE.</p>
            </def></def-item>
            <def-item><term>Gaussian process regression</term><def>

      <p id="d1e1650">The RMSE of the GPR is close to that of the log-SVR, making it one of the strongest models as well. In addition, it has previously been used to predict simulator outputs <xref ref-type="bibr" rid="bib1.bibx10" id="paren.34"/>.</p>
            </def></def-item>
            <def-item><term>Log-linear regression</term><def>

      <p id="d1e1662">As a linear model, the log-linear regression is useful if model interpretability is required (for more details on its interpretability, see Sect. <xref ref-type="sec" rid="Ch1.S2.SS4"/>). What separates the log-linear regression from the other linear models is that it uses the fewest number of features, while also ensuring positive predictions through log transformation, similar to the logarithmic SVR. It works as a simpler counterpart to the more powerful methods.</p>
            </def></def-item>
          </def-list></p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T3" specific-use="star"><?xmltex \currentcnt{3}?><label>Table 3</label><caption><p id="d1e1672">Comparison of selected models based on KU18. Performance refers to the cross-validated RMSE; number of features refers to the optimal number selected during forward feature selection. Prior use means the model has been used for similar tasks in literature. A “<inline-formula><mml:math id="M61" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula>” means better than average, which is not necessarily more than average.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="6">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="center"/>
     <oasis:colspec colnum="3" colname="col3" align="center"/>
     <oasis:colspec colnum="4" colname="col4" align="center"/>
     <oasis:colspec colnum="5" colname="col5" align="center"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:thead>
       <oasis:row>
         <oasis:entry colname="col1">Model</oasis:entry>
         <oasis:entry colname="col2">Performance</oasis:entry>
         <oasis:entry colname="col3">Number of</oasis:entry>
         <oasis:entry colname="col4">Interpretable</oasis:entry>
         <oasis:entry colname="col5">Positive</oasis:entry>
         <oasis:entry colname="col6">Prior use</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3">features</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">predictions</oasis:entry>
         <oasis:entry colname="col6"/>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Logarithmic SVR</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M62" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M63" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"><inline-formula><mml:math id="M64" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M65" display="inline"><mml:mi mathvariant="italic">✓</mml:mi></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"><inline-formula><mml:math id="M66" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Gaussian process</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M67" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M68" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"><inline-formula><mml:math id="M69" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M70" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"><inline-formula><mml:math id="M71" display="inline"><mml:mi mathvariant="italic">✓</mml:mi></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Log-linear regression</oasis:entry>
         <oasis:entry colname="col2">–</oasis:entry>
         <oasis:entry colname="col3">–</oasis:entry>
         <oasis:entry colname="col4"><inline-formula><mml:math id="M72" display="inline"><mml:mi mathvariant="italic">✓</mml:mi></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M73" display="inline"><mml:mi mathvariant="italic">✓</mml:mi></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"><inline-formula><mml:math id="M74" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula></oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Model evaluation</title>
      <p id="d1e1885">As a final test, the models are evaluated on an independent dataset to obtain a more accurate estimate of their performance in a real-world urban planning situation. The models cannot be evaluated solely based on the training data, since model overfitting would not be detected, and the error estimation in real-world situations would be poor. For the evaluation, we use the models trained on both wind directions and all city plans of KU18 (as described in Sect. <xref ref-type="sec" rid="Ch1.S3.SS1"/>) and we evaluate them on both wind directions of KA20.</p>
      <p id="d1e1890">We use the mRMSE and the scaled bias defined in Sect. <xref ref-type="sec" rid="Ch1.S2.SS5"/> for KA20, with the results for PM<inline-formula><mml:math id="M75" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula> concentrations listed in Table <xref ref-type="table" rid="Ch1.T4"/>. Due to the scaling, the mRMSE does not allow direct comparisons between the errors on KU18 and KA20. On the evaluation data, we can however compare the models to the dummy model and see that, with the exception of the Poisson regressions, all models clearly outperform the dummy model, although the performance is more varied than with the training data. The best-performing model is the log-linear model (mRMSE <inline-formula><mml:math id="M76" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 0.76). The log-SVR and GPR also performed well (mRMSE <inline-formula><mml:math id="M77" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 0.87 and mRMSE <inline-formula><mml:math id="M78" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 0.91, respectively) when compared to the dummy model (mRMSE <inline-formula><mml:math id="M79" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 1.78).</p>
      <p id="d1e1935">Out of the selected models, the log-linear model is preferable. Not only does it perform the best out of all the three<?pagebreak page7418?> models selected but it is also the simplest and it runs the fastest. Its simplicity is likely the reason for its good performance given the notable differences between the training and evaluation data.</p>
      <p id="d1e1938">Although log-SVR and GPR both perform better than the average model, compared to the other models, their performance is worse in the evaluation data when comparing to the cross-validation procedure.
This could be a sign of slight overfitting to the training data or sensitivity of the different distribution of data on the testing dataset.</p>
      <p id="d1e1942">The interpretation of the scaled bias is not straightforward due to the scaling, but it shows that all of the models overestimate <inline-formula><mml:math id="M80" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> in the evaluation data with the SVR overestimating the least. The dummy model is, unsurprisingly, the most biased.</p>
      <p id="d1e1955">The scaled residuals can be seen in Fig. <xref ref-type="fig" rid="Ch1.F5"/> for the selected models. At a glance, they may seem to be similar for different models but there are differences between the predictions displayed even by the different mRMSE. The differences are due to the dissimilar ways of building a model and the features selected in Sect. <xref ref-type="sec" rid="Ch1.S2.SS3"/>. All three selected models acquire most of the error on the boulevard, while the outskirts are predicted more consistently. This is not a surprise since <inline-formula><mml:math id="M81" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> is much lower outside of the boulevard. The figure also show that the log-linear model has more balanced residuals while the more complex SVR and GPR are able to achieve lower residuals on the outskirts but perform relatively poorly on the boulevard. The residuals also show that none of the models are fully capable of capturing details in the pollutant dispersion that arise due to fine-scale flow patterns.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F5"><?xmltex \currentcnt{5}?><?xmltex \def\figurename{Figure}?><label>Figure 5</label><caption><p id="d1e1974">Model residuals of PM<inline-formula><mml:math id="M82" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula> predictions in KA20 for <bold>(a)</bold> the log-linear regression, <bold>(b)</bold> the logarithmic SVR and <bold>(c)</bold> the Gaussian process. The wind is coming from the left. Residuals are calculated after scaling with the respective optimal <inline-formula><mml:math id="M83" display="inline"><mml:mi>a</mml:mi></mml:math></inline-formula> for mRMSE calculation. Notice that although the residuals look similar, the predictions have notable differences. The scaled <inline-formula><mml:math id="M84" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> ranges up to a maximum of 7.28. The residuals can also be contrasted to the mean squared error of the dummy model, which is <inline-formula><mml:math id="M85" display="inline"><mml:mn mathvariant="normal">1.78</mml:mn></mml:math></inline-formula>.</p></caption>
          <?xmltex \igopts{width=227.622047pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f05.png"/>

        </fig>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T4" specific-use="star"><?xmltex \currentcnt{4}?><label>Table 4</label><caption><p id="d1e2029">Cross validation and evaluation error for all models, obtained with their respective optimal set of features. Cross-validation RMSE is on KU18, while mRMSE and scaled bias are calculated on PM<inline-formula><mml:math id="M86" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula> in KA20. In bold are the three models that were ultimately selected for performance (based on cross-validation RMSE) or interpretability.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Model</oasis:entry>
         <oasis:entry colname="col2">Number of features</oasis:entry>
         <oasis:entry colname="col3">Cross-validation RMSE</oasis:entry>
         <oasis:entry colname="col4">mRMSE (PM<inline-formula><mml:math id="M87" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula>)</oasis:entry>
         <oasis:entry colname="col5">Scaled bias (PM<inline-formula><mml:math id="M88" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula>)</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Dummy model</oasis:entry>
         <oasis:entry colname="col2">0</oasis:entry>
         <oasis:entry colname="col3">2.3</oasis:entry>
         <oasis:entry colname="col4">1.78</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M89" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>1.47</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Decision tree</oasis:entry>
         <oasis:entry colname="col2">4</oasis:entry>
         <oasis:entry colname="col3">1.64</oasis:entry>
         <oasis:entry colname="col4">1.35</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M90" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.60</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"><bold>Gaussian process</bold></oasis:entry>
         <oasis:entry colname="col2"><bold>6</bold></oasis:entry>
         <oasis:entry colname="col3"><bold>1.55</bold></oasis:entry>
         <oasis:entry colname="col4"><bold>0.91</bold></oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M91" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula><bold>0.50</bold></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Gradient boosting</oasis:entry>
         <oasis:entry colname="col2">10</oasis:entry>
         <oasis:entry colname="col3">1.71</oasis:entry>
         <oasis:entry colname="col4">0.84</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M92" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.47</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Linear regression</oasis:entry>
         <oasis:entry colname="col2">14</oasis:entry>
         <oasis:entry colname="col3">1.58</oasis:entry>
         <oasis:entry colname="col4">1.11</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M93" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.47</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"><bold>Log-linear regression</bold></oasis:entry>
         <oasis:entry colname="col2"><bold>10</bold></oasis:entry>
         <oasis:entry colname="col3"><bold>1.61</bold></oasis:entry>
         <oasis:entry colname="col4"><bold>0.76</bold></oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M94" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula><bold>0.47</bold></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"><bold>Logarithmic support vector regression</bold></oasis:entry>
         <oasis:entry colname="col2"><bold>5</bold></oasis:entry>
         <oasis:entry colname="col3"><bold>1.53</bold></oasis:entry>
         <oasis:entry colname="col4"><bold>0.87</bold></oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M95" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula><bold>0.42</bold></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Poisson regression</oasis:entry>
         <oasis:entry colname="col2">13</oasis:entry>
         <oasis:entry colname="col3">1.58</oasis:entry>
         <oasis:entry colname="col4">2.12</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M96" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.70</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Random forest</oasis:entry>
         <oasis:entry colname="col2">9</oasis:entry>
         <oasis:entry colname="col3">1.58</oasis:entry>
         <oasis:entry colname="col4">1.06</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M97" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.60</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Support vector regression</oasis:entry>
         <oasis:entry colname="col2">5</oasis:entry>
         <oasis:entry colname="col3">1.53</oasis:entry>
         <oasis:entry colname="col4">0.83</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M98" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.40</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Zero-inflated Poisson regression</oasis:entry>
         <oasis:entry colname="col2">13</oasis:entry>
         <oasis:entry colname="col3">1.57</oasis:entry>
         <oasis:entry colname="col4">1.80</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M99" display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>0.65</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S3.SS4">
  <label>3.4</label><title>Concept drift detection</title>
      <?pagebreak page7419?><p id="d1e2387">It is important to know whether the results obtained generalize to different city plans. Because of the high computational cost of running LES, we often do not have access to the LES output values of a new plan, and hence we cannot assess the prediction error directly in such a case. We can, however, use the Drifter algorithm by <xref ref-type="bibr" rid="bib1.bibx31" id="text.35"/> to estimate whether RMSE of a model prediction is high or not. The Drifter algorithm is designed for detecting “virtual concept drift”; <xref ref-type="bibr" rid="bib1.bibx9" id="paren.36"/>, i.e. detecting  changes in the distribution of the features that affect the performance of the model. The idea behind Drifter is that we define a distance measure <inline-formula><mml:math id="M100" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> that measures how far a covariate vector <inline-formula><mml:math id="M101" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula> is from the data that have been used to train the model. Small values of <inline-formula><mml:math id="M102" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which is called the “concept drift indicator”, mean that we are close to the training data and the model should be reliable, while a large value of <inline-formula><mml:math id="M103" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> means that we have moved away from the training data, after which the regression estimate may be inaccurate.
The distance measure <inline-formula><mml:math id="M104" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is defined as follows. First, a family of so-called “segment models” <inline-formula><mml:math id="M105" display="inline"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is trained using only a part of the training data for each. Then, for each of the segment models <inline-formula><mml:math id="M106" display="inline"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, we can compute an estimate of the generalization error using the terms <inline-formula><mml:math id="M107" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>f</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> instead of the terms <inline-formula><mml:math id="M108" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>f</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:mi>p</mml:mi><mml:mi>c</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M109" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> is the LES output value, when computing the RMSE. For a simple linear model, this kind of a measure is monotonically related to the expected quadratic error of the model <xref ref-type="bibr" rid="bib1.bibx31" id="paren.37"/>. Now, the ensemble of estimates for the segment models allows to compute a statistic for estimating RMSE. In <xref ref-type="bibr" rid="bib1.bibx31" id="text.38"/>, the statistic (i.e. the concept drift indicator <inline-formula><mml:math id="M110" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>) was chosen to be the second smallest error estimate.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F6"><?xmltex \currentcnt{6}?><?xmltex \def\figurename{Figure}?><label>Figure 6</label><caption><p id="d1e2576">Example of a single evaluation segment (light blue square) applied in the concept drift analysis for KA20 using the Drifter algorithm.</p></caption>
          <?xmltex \igopts{width=227.622047pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f06.png"/>

        </fig>

      <p id="d1e2585">Originally, the Drifter algorithm has been used for time series data. We adapt it for spatial data here by selecting the  segments of the training data not to be temporally close-by data points but  spatially close-by points. We hence divide the map into <inline-formula><mml:math id="M111" display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>×</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:math></inline-formula> squares that are used as the segments. An example of such a segment can be seen in Fig. <xref ref-type="fig" rid="Ch1.F6"/>. Drifter is run for all models considered in Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/> as the complete model.
The original article suggests the usage of a simple linear model as the segment model, but here a ridge regression with logarithmic transformation and with a small regularization parameter of <inline-formula><mml:math id="M112" display="inline"><mml:mrow><mml:mi mathvariant="italic">λ</mml:mi><mml:mspace linebreak="nobreak" width="0.125em"/></mml:mrow></mml:math></inline-formula>=<inline-formula><mml:math id="M113" display="inline"><mml:mrow><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">0.01</mml:mn></mml:mrow></mml:math></inline-formula> is chosen for practical<?pagebreak page7420?> reasons. This gives similar results to a log-linear model while also being well defined on the segments where some features are constant.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F7" specific-use="star"><?xmltex \currentcnt{7}?><?xmltex \def\figurename{Figure}?><label>Figure 7</label><caption><p id="d1e2625">The relationship between the concept drift indicator and observed RMSE in a given segment in KU18 (circles) and KA20 (crosses) with <bold>(a)</bold> log-linear regression, <bold>(b)</bold> logarithmic support vector regression, <bold>(c)</bold> Gaussian process and <bold>(d)</bold> log-linear regression with all features used as the complete model. The opacity of the training data has been reduced by <inline-formula><mml:math id="M114" display="inline"><mml:mrow><mml:mn mathvariant="normal">80</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="italic">%</mml:mi></mml:mrow></mml:math></inline-formula> to make the figure clearer.
</p></caption>
          <?xmltex \igopts{width=455.244094pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f07.png"/>

        </fig>

      <?xmltex \floatpos{t}?><fig id="Ch1.F8" specific-use="star"><?xmltex \currentcnt{8}?><?xmltex \def\figurename{Figure}?><label>Figure 8</label><caption><p id="d1e2659">A map showing the concept drift indicator <inline-formula><mml:math id="M115" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> in a given segment of KA20 for <bold>(a)</bold> log-linear regression, <bold>(b)</bold> logarithmic support vector regression, <bold>(c)</bold> Gaussian process and <bold>(d)</bold> log-linear regression with all features used as the complete model. Notice how the last model has a high concept drift indicator on the boulevard.</p></caption>
          <?xmltex \igopts{width=341.433071pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/14/7411/2021/gmd-14-7411-2021-f08.png"/>

        </fig>

      <p id="d1e2694">In order to keep the results comparable to Sect. <xref ref-type="sec" rid="Ch1.S3"/>, we also scale the evaluation dataset (KA20) by multiplying it with the constant that minimizes RMSE. This, if the evaluation data are not segmented, is equivalent to using the same multiplicative minimum-RMSE error measure as in Sect. <xref ref-type="sec" rid="Ch1.S3"/>. This decision does not affect the concept drift indicator but it will make the simulation error of the training and evaluation more comparable.
We train Drifter using all eight simulation setups of KU18 (i.e. four city plans and two wind directions) and evaluate it on both wind directions of KA20.
We use <inline-formula><mml:math id="M116" display="inline"><mml:mrow><mml:mn mathvariant="normal">100</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> squares
as our segments in the training data with each square overlapping <inline-formula><mml:math id="M117" display="inline"><mml:mrow><mml:mn mathvariant="normal">25</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="italic">%</mml:mi></mml:mrow></mml:math></inline-formula> out of four other training segments and <inline-formula><mml:math id="M118" display="inline"><mml:mrow><mml:mn mathvariant="normal">25</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo>×</mml:mo><mml:mn mathvariant="normal">25</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> squares in the evaluation data with no overlap.
The concept drift indicator <inline-formula><mml:math id="M119" display="inline"><mml:mrow><mml:mi mathvariant="normal">d</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is chosen to be the tenth smallest error estimate. The overlapping scheme used is the same as in the original article, but since the data are two-dimensional, it is applied for both dimensions (<inline-formula><mml:math id="M120" display="inline"><mml:mrow><mml:mn mathvariant="normal">50</mml:mn><mml:mi mathvariant="italic">%</mml:mi><mml:mo>⋅</mml:mo><mml:mn mathvariant="normal">50</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="italic">%</mml:mi><mml:mspace width="0.125em" linebreak="nobreak"/></mml:mrow></mml:math></inline-formula>=<inline-formula><mml:math id="M121" display="inline"><mml:mrow><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">25</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi mathvariant="italic">%</mml:mi></mml:mrow></mml:math></inline-formula>). These
parameters are chosen with a grid search for the log-linear model with all features (a situation exhibiting concept drift).</p>
      <p id="d1e2797">In Fig. <xref ref-type="fig" rid="Ch1.F7"/>a–c, we show the concept drift indicator value and RMSE for each <inline-formula><mml:math id="M122" display="inline"><mml:mrow><mml:mn mathvariant="normal">25</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo>×</mml:mo><mml:mn mathvariant="normal">25</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> segment in the evaluation data alongside the same-sized segments in the training data for our final models in the same figure. With all three models considered, the evaluation data concept drift indicator is indeed correlated with its RMSE and thus can be used to estimate it. We also notice that for all three models the evaluation segments lie in the same area as the training segments indicating the lack of concept drift.
In addition, we notice that the concept drift indicator values are larger on the boulevard which corresponds to the fact that the RMSE is indeed larger there. Therefore, the concept drift indicator is useful in detecting areas of large RMSE even when if the ground truth (LES output) are not known.</p>
      <p id="d1e2822">Another example of the concept drift detection is given by testing Drifter with a sub-optimally performing model. If the same procedure is done for the log-linear model with all of the surrogate features and not with the set selected by forward selection, the model overfits. Unlike in the models above, we see from Fig. <xref ref-type="fig" rid="Ch1.F7"/>d that multiple evaluation segments lie on the right of the cluster of the training segments indicating concept drift. These points also have a higher RMSE which shows that Drifter is working as intended.
From Fig. <xref ref-type="fig" rid="Ch1.F8"/>d, we see that concept drift is detected only on the boulevard which is as expected, because also the RMSEs are higher there.</p>
</sec>
</sec>
<sec id="Ch1.S4" sec-type="conclusions">
  <label>4</label><title>Discussion and conclusions</title>
      <p id="d1e2838">This study demonstrates that machine-learning methods trained with LES data can be used to model street-level pollutant concentrations in a city-boulevard-type urban neighbourhood.
The accuracy of the models is explored with an independent evaluation dataset to ensure their applicability in urban planning for new, similar types of neighbourhoods.</p>
      <p id="d1e2841">The log-linear regression has the greatest potential for replicating LES results even with a relatively small amount of data. It also has much potential in helping to understand which urban features govern local pollutant concentrations. The kernelized methods, SVR and GPR, show moderate performance and perform generally well in capturing the local mean concentration. However, all three have trouble in representing smaller-scale details in the concentration fields linked to turbulence. Furthermore, all models perform worse on the boulevard than in its surroundings. Still, these models beat the dummy model by a notable margin (e.g. RMSE <inline-formula><mml:math id="M123" display="inline"><mml:mrow><mml:mo>≤</mml:mo><mml:mn mathvariant="normal">0.91</mml:mn></mml:mrow></mml:math></inline-formula>  for the  selected models compared to RMSE <inline-formula><mml:math id="M124" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> <inline-formula><mml:math id="M125" display="inline"><mml:mn mathvariant="normal">1.78</mml:mn></mml:math></inline-formula> for the dummy model). In general, bias in the models is slightly negative.</p>
      <p id="d1e2868">No previous studies on applying LES air pollution data to train a machine-learning model exist, and therefore direct comparison is unfeasible. Also comparing to studies applying spatial air quality measurements <xref ref-type="bibr" rid="bib1.bibx1 bib1.bibx13 bib1.bibx18 bib1.bibx37" id="paren.39"/> is difficult, as the spatial resolution of the training data is of the lower order of magnitude. Still, some linkage can be found. <xref ref-type="bibr" rid="bib1.bibx1" id="text.40"/>, <xref ref-type="bibr" rid="bib1.bibx18" id="text.41"/> and <xref ref-type="bibr" rid="bib1.bibx37" id="text.42"/> used mobile air quality measurements to train their models to produce spatial air quality predictions and concluded that the models tend to underestimate the localized peak values, as also shown in this study. <xref ref-type="bibr" rid="bib1.bibx32" id="text.43"/> also found that linear models lead to a smaller bias for a short data record when applying modelled air quality data to conduct up to 48 <inline-formula><mml:math id="M126" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">h</mml:mi></mml:mrow></mml:math></inline-formula> air quality predictions. On the contrary, SVR was shown to outperform all other methods in <xref ref-type="bibr" rid="bib1.bibx13" id="text.44"/>. However, their data had a spatial resolution of 15 <inline-formula><mml:math id="M127" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">m</mml:mi></mml:mrow></mml:math></inline-formula> or more, which can explain why the other models had trouble generalizing the data. To achieve the best generalization and to avoid overfitting, <xref ref-type="bibr" rid="bib1.bibx37" id="text.45"/> stresses the importance of high-quality model predictors, high temporal and spatial resolution of training data and evaluation against an independent dataset, which are all fulfilled in this study.</p>
      <p id="d1e2909">A downside of the developed models is that, when applied to a new city plan, they require the plan to lead to a similar statistical distribution of the pollutant concentrations as the ones used for training, which means that the models have to cope with moderate amounts of concept drift. We can, however, detect this to avoid potentially false predictions that may occur when applying the models on data that do not follow the training distribution. Yet another limitation is the small number of simulation setups of the data: there are only eight different simulation setups (i.e. four city plans and two wind directions) to train with, which rules out methods requiring large annotated training datasets, such as some applications of deep learning. With more data, the models<?pagebreak page7421?> could potentially reach completely another level of accuracy. Another way to improve the accuracy would be to have even better and more capturing features. If very accurate results are needed, running a new LES is still the best approach. Still, model predictions are accurate enough for many purposes, such as to study pollutant exposure and to support urban planning.</p>
      <p id="d1e2913">The developed methods can be used to further probe new meteorological conditions and city plans. In a future study, these models could help us to understand how simple changes into the layout of the city plan, e.g. a new building, affect the local air pollutant concentrations. Eventually similar models could also be used to understand the complicated phenomena in simple urban areas.</p>
      <p id="d1e2916">To conclude, we have explored using different machine-learning models how to emulate air pollutant concentrations as simulated using LES. We use LESs made over two different boulevard-type street canyons to study the impact of building-block layouts on air pollutant concentrations. We examine the performance of 10 machine-learning methods by using site-specific features to predict the surface-level concentrations over the neighbourhoods. A total of 20 features are determined from the LES inputs and outputs. The results show how the studied machine-learning methods are able to produce the mean pollutant concentrations. Further, concept drift detection is used to detect areas where the model cannot be trusted and more simulation runs may be needed.</p>
</sec>

      
      </body>
    <back><app-group>

<app id="App1.Ch1.S1">
  <?xmltex \currentcnt{A}?><label>Appendix A</label><title>Detailed descriptions of the features used in the regression models</title>

<?xmltex \floatpos{t}?><table-wrap id="App1.Ch1.S1.T5" specific-use="star"><?xmltex \currentcnt{A1}?><label>Table A1</label><caption><p id="d1e2934">Detailed descriptions of features in Table <xref ref-type="table" rid="Ch1.T1"/>. The Supplement contains visualizations of these features. Note that these features were computed with the help of auxiliary features that are not listed here, such as is_building, is_inhabited and is_intersection. For more details, see the supplied software code.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="2">
     <oasis:colspec colnum="1" colname="col1" align="justify" colwidth="4.5cm"/>
     <oasis:colspec colnum="2" colname="col2" align="justify" colwidth="12cm"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Feature</oasis:entry>
         <oasis:entry colname="col2">Description</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Building height</oasis:entry>
         <oasis:entry colname="col2">Height over ground at the nearest point of the nearest building.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Canopy height</oasis:entry>
         <oasis:entry colname="col2">Height of the vegetation canopy at point (<inline-formula><mml:math id="M128" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M129" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula>).</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Courtyard</oasis:entry>
         <oasis:entry colname="col2">Equal to 1 if a there exists a cross laid on a point (<inline-formula><mml:math id="M130" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M131" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula>) that intersects the same building on all four sides, else 0.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Direction of closest building</oasis:entry>
         <oasis:entry colname="col2">Direction of the closest building with respect to the mean wind direction. Due to symmetry has values between 0 and <inline-formula><mml:math id="M132" display="inline"><mml:mrow><mml:mi mathvariant="italic">π</mml:mi><mml:mo>(</mml:mo><mml:mo>=</mml:mo><mml:mn mathvariant="normal">180</mml:mn><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Distance to building downwind</oasis:entry>
         <oasis:entry colname="col2">Distance to the next building when walking directly with the mean wind direction. Not defined in areas that do not have buildings in the direction of the mean wind.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Distance to building upwind</oasis:entry>
         <oasis:entry colname="col2">Same as “distance to building downwind”, when walking  against the mean wind direction.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Height to width ratio</oasis:entry>
         <oasis:entry colname="col2">Height to width ratio of the area that the point is in (e.g. a street), calculated as “building height” divided by “street width”.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Pollutant emissions</oasis:entry>
         <oasis:entry colname="col2">Pollutant emission factor weighed by the street type (see <xref ref-type="bibr" rid="bib1.bibx20" id="altparen.46"/> for details). Takes values 0, 1, 2, 4.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Pollutant emissions convolution,<?xmltex \hack{\hfill\break}?> <inline-formula><mml:math id="M133" display="inline"><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">4</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">8</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">16</mml:mn><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col2">Weighted pollutant emissions in the surroundings. Convolution of the weighted pollutant emissions, using a Gaussian kernel of varying sizes <inline-formula><mml:math id="M134" display="inline"><mml:mi mathvariant="italic">σ</mml:mi></mml:math></inline-formula>.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Pollutant emissions convolution<?xmltex \hack{\hfill\break}?>upwind, <inline-formula><mml:math id="M135" display="inline"><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">8</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">16</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">32</mml:mn><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col2">Same as “pollutant emissions convolution”, when walking against the mean wind direction. This is estimated as a cone that covers a triangular area in the upwind direction. The kernel is a normalized cutout of an ordinary Gaussian kernel that resembles a cone pointing away from the mean wind source.</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Street</oasis:entry>
         <oasis:entry colname="col2">Equal to 1 if point (<inline-formula><mml:math id="M136" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M137" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula>) is in an inhabited area (is_inhabited <inline-formula><mml:math id="M138" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 1), is not a courtyard or an intersection (is_courtyard <inline-formula><mml:math id="M139" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> is_intersection <inline-formula><mml:math id="M140" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula> 0) and has a street width of less than 100 (street_width <inline-formula><mml:math id="M141" display="inline"><mml:mrow><mml:mo>&lt;</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>), else 0.</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Street width</oasis:entry>
         <oasis:entry colname="col2">The sum of the distance from point (<inline-formula><mml:math id="M142" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M143" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula>) to the nearest building, plus the distance from point (<inline-formula><mml:math id="M144" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M145" display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula>) to second nearest building. For courtyards, the minimum is taken (e.g. street width in a courtyard sized 10 m <inline-formula><mml:math id="M146" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 20 m is defined to be 10 m).</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<?xmltex \hack{\newpage}?>
</app>
  </app-group><notes notes-type="codedataavailability"><title>Code and data availability</title>

      <p id="d1e3277">The code to reproduce the regression models is available at <ext-link xlink:href="https://doi.org/10.5281/zenodo.3999302" ext-link-type="DOI">10.5281/zenodo.3999302</ext-link> <xref ref-type="bibr" rid="bib1.bibx24" id="paren.47"/>. The input and output data for KU18 are available at <uri>http://urn.fi/urn:nbn:fi:att:cfe1bd77-6697-44b5-bdd7-ee74f36c7dcd</uri> <xref ref-type="bibr" rid="bib1.bibx22" id="paren.48"/>. The input data for KA20 are available at <ext-link xlink:href="https://doi.org/10.5281/zenodo.3556287" ext-link-type="DOI">10.5281/zenodo.3556287</ext-link> <xref ref-type="bibr" rid="bib1.bibx15" id="paren.49"/> and the output data are available at <uri>http://urn.fi/urn:nbn:fi:att:ee275362-3f56-477c-bbbc-6fcacd9c7f95</uri> <xref ref-type="bibr" rid="bib1.bibx16" id="paren.50"/>.</p>
  </notes><app-group>
        <supplementary-material position="anchor"><p id="d1e3305">The supplement related to this article is available online at: <inline-supplementary-material xlink:href="https://doi.org/10.5194/gmd-14-7411-2021-supplement" xlink:title="pdf">https://doi.org/10.5194/gmd-14-7411-2021-supplement</inline-supplementary-material>.</p></supplementary-material>
        </app-group><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d1e3314">LJ, MK, KP and EO designed the concept of the study. ML and HS pre-processed the LES outputs, prepared and conducted the machine-learning simulations and statistical analyses with contributions from EO, RS and KP. LJ and MK provided expert advice on the LES model inputs and outputs. All co-authors participated in writing the manuscript with contributions from all co-authors.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d1e3320">The authors declare that they have no conflict of interest.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d1e3326">Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
  </notes><ack><title>Acknowledgements</title><p id="d1e3332">For financial support we would like to thank the Academy of Finland (profiling action 3 and decisions 326280 and 326339), Helsinki Institute for Information Technology HIIT, the Doctoral Programme in Atmospheric
Sciences (ATM-DP), and the Doctoral Programme in Computer
Science (DoCS) at the University of Helsinki, and SMart URBan Solutions for air quality,<?pagebreak page7423?> disasters and city growth (SMURBS, no. 689443) funded by ERA-NET-Cofund project under ERA-PLANET.</p></ack><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d1e3337">This research has been supported by the Academy of Finland (grant nos. 326280, 326339,  320182), the Helsinki Institute for Information Technology HIIT, the Doctoral Programme in Atmospheric Sciences (ATM-DP), University of Helsinki, the Doctoral Programme in Computer Sciences (DoCS), University of Helsinki, and the ERA-NET-Cofund, ERA-PLANET (grant no. 689443).<?xmltex \hack{\newline}?><?xmltex \hack{\newline}?>Open-access funding was provided by the Helsinki<?xmltex \notforhtml{\newline}?> University Library.</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d1e3348">This paper was edited by Adrian Sandu and reviewed by two anonymous referees.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><?xmltex \def\ref@label{{Adams and Kanaroglou(2016)}}?><label>Adams and Kanaroglou(2016)</label><?label adams2016?><mixed-citation>Adams, M. D. and Kanaroglou, P. S.: Mapping real-time air pollution health risk
for environmental management: Combining mobile and stationary air pollution
monitoring with neural network models, J. Environ. Manag.,
168, 133–141, <ext-link xlink:href="https://doi.org/10.1016/j.jenvman.2015.12.012" ext-link-type="DOI">10.1016/j.jenvman.2015.12.012</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx2"><?xmltex \def\ref@label{{Araki et~al.(2018)Araki, Shima, and Yamamoto}}?><label>Araki et al.(2018)Araki, Shima, and Yamamoto</label><?label araki2018?><mixed-citation>Araki, S., Shima, M., and Yamamoto, K.: Spatiotemporal land use random forest
model for estimating metropolitan NO<inline-formula><mml:math id="M147" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2</mml:mn></mml:msub></mml:math></inline-formula> exposure in Japan, Sci. Total
Environ., 634, 1269–1277, <ext-link xlink:href="https://doi.org/10.1016/j.scitotenv.2018.03.324" ext-link-type="DOI">10.1016/j.scitotenv.2018.03.324</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx3"><?xmltex \def\ref@label{{Auvinen et~al.(2020)Auvinen, Boi, Hellsten, Tanhuanp{\"{a}}{\"{a}},
and J{\"{a}}rvi}}?><label>Auvinen et al.(2020)Auvinen, Boi, Hellsten, Tanhuanpää,
and Järvi</label><?label auvinen2020?><mixed-citation>Auvinen, M., Boi, S., Hellsten, A., Tanhuanpää, T., and
Järvi, L.: Study of realistic urban boundary layer turbulence with
high-resolution large-eddy simulation, Atmosphere, 11, 201,
<ext-link xlink:href="https://doi.org/10.3390/atmos11020201" ext-link-type="DOI">10.3390/atmos11020201</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx4"><?xmltex \def\ref@label{{Benoit(2011)}}?><label>Benoit(2011)</label><?label benoit2011linear?><mixed-citation>
Benoit, K.: Linear regression models with logarithmic transformations, London
School of Economics, London, 22, 23–36, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx5"><?xmltex \def\ref@label{{Britter and Hanna(2003)}}?><label>Britter and Hanna(2003)</label><?label britter2003?><mixed-citation>Britter, R. E. and Hanna, S. R.: Flow and dispersion in urban areas, Ann.
Rev. Fluid Mech., 35, 469–496,
<ext-link xlink:href="https://doi.org/10.1146/annurev.fluid.35.101101.161147" ext-link-type="DOI">10.1146/annurev.fluid.35.101101.161147</ext-link>, 2003.</mixed-citation></ref>
      <ref id="bib1.bibx6"><?xmltex \def\ref@label{{Cristianini and Shawe-Taylor(2000)}}?><label>Cristianini and Shawe-Taylor(2000)</label><?label cristianini2000?><mixed-citation>Cristianini, N. and Shawe-Taylor, J.: An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods, Cambridge University Press,  Cambridge, United Kingdom,
<ext-link xlink:href="https://doi.org/10.1017/CBO9780511801389" ext-link-type="DOI">10.1017/CBO9780511801389</ext-link>, 2000.</mixed-citation></ref>
      <ref id="bib1.bibx7"><?xmltex \def\ref@label{{Feng et~al.(2019)Feng, Zheng, Gao, Zhang, Huang, Zhang, Luo, and
Fan}}?><label>Feng et al.(2019)Feng, Zheng, Gao, Zhang, Huang, Zhang, Luo, and
Fan</label><?label feng2019?><mixed-citation>Feng, R., Zheng, H.-J., Gao, H., Zhang, A.-R., Huang, C., Zhang, J.-X., Luo,
K., and Fan, J.-R.: Recurrent Neural Network and random forest for analysis
and accurate forecast of atmospheric pollutants: A case study in Hangzhou,
China, J. Cleaner Product., 231, 1005–1015,
<ext-link xlink:href="https://doi.org/10.1016/j.jclepro.2019.05.319" ext-link-type="DOI">10.1016/j.jclepro.2019.05.319</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx8"><?xmltex \def\ref@label{{Fern\'{a}ndez-Delgado et~al.(2014)Fern\'{a}ndez-Delgado, Cernadas,
Barro, and Amorim}}?><label>Fernández-Delgado et al.(2014)Fernández-Delgado, Cernadas,
Barro, and Amorim</label><?label delgado2014?><mixed-citation>
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D.: Do We Need
Hundreds of Classifiers to Solve Real World Classification Problems?, J.
Mach. Learn. Res., 15, 3133–3181, 2014.</mixed-citation></ref>
      <ref id="bib1.bibx9"><?xmltex \def\ref@label{{Gama et~al.(2014)Gama, \v{Z}liobaitundefined, Bifet, Pechenizkiy, and
Bouchachia}}?><label>Gama et al.(2014)Gama, Žliobaitundefined, Bifet, Pechenizkiy, and
Bouchachia</label><?label gama2014?><mixed-citation>Gama, J. A., Žliobaitundefined, I., Bifet, A., Pechenizkiy, M., and
Bouchachia, A.: A Survey on Concept Drift Adaptation, ACM Comput. Surv., 46, 44,
<ext-link xlink:href="https://doi.org/10.1145/2523813" ext-link-type="DOI">10.1145/2523813</ext-link>, 2014.</mixed-citation></ref>
      <ref id="bib1.bibx10"><?xmltex \def\ref@label{{Gómez-Dans et~al.(2016)Gómez-Dans, Lewis, and
Disney}}?><label>Gómez-Dans et al.(2016)Gómez-Dans, Lewis, and
Disney</label><?label gomezdans2016?><mixed-citation>Gómez-Dans, J. L., Lewis, P. E., and Disney, M.: Efficient Emulation of
Radiative Transfer Codes Using Gaussian Processes and Application to Land
Surface Parameter Inferences, Remote Sensing, 8, <ext-link xlink:href="https://doi.org/10.3390/rs8020119" ext-link-type="DOI">10.3390/rs8020119</ext-link>,
2016.</mixed-citation></ref>
      <ref id="bib1.bibx11"><?xmltex \def\ref@label{{Hastie et~al.(2009)Hastie, Tibshirani, and Friedman}}?><label>Hastie et al.(2009)Hastie, Tibshirani, and Friedman</label><?label ESL2009?><mixed-citation>
Hastie, T., Tibshirani, R., and Friedman, J.: The Elements of Statistical
Learning, Springer-Verlag, 2009.</mixed-citation></ref>
      <ref id="bib1.bibx12"><?xmltex \def\ref@label{{Heus et~al.(2010)Heus, van Heerwaarden, Jonker, Pier~Siebesma,
Axelsen, van~den Dries, Geoffroy, Moene, Pino, de~Roode, and Vil\`{a}-Guerau~de
Arellano}}?><label>Heus et al.(2010)Heus, van Heerwaarden, Jonker, Pier Siebesma,
Axelsen, van den Dries, Geoffroy, Moene, Pino, de Roode, and Vilà-Guerau de
Arellano</label><?label heus2010?><mixed-citation>Heus, T., van Heerwaarden, C. C., Jonker, H. J. J., Pier Siebesma, A., Axelsen,
S., van den Dries, K., Geoffroy, O., Moene, A. F., Pino, D., de Roode, S. R.,
and Vilà-Guerau de Arellano, J.: Formulation of the Dutch Atmospheric
Large-Eddy Simulation (DALES) and overview of its applications, Geosci. Model
Dev., 3, 415–444, <ext-link xlink:href="https://doi.org/10.5194/gmd-3-415-2010" ext-link-type="DOI">10.5194/gmd-3-415-2010</ext-link>, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx13"><?xmltex \def\ref@label{{Hu et~al.(2017){Hu}, {Rahman}, {Bhrugubanda}, and
{Sivaraman}}}?><label>Hu et al.(2017)Hu, Rahman, Bhrugubanda, and
Sivaraman</label><?label hu2017?><mixed-citation>Hu, K., Rahman, A., Bhrugubanda, H., and Sivaraman, V.: HazeEst:
Machine Learning Based Metropolitan Air Pollution Estimation From Fixed and
Mobile Sensors, IEEE Sensors J., 17, 3517–3525,
<ext-link xlink:href="https://doi.org/10.1109/JSEN.2017.2690975" ext-link-type="DOI">10.1109/JSEN.2017.2690975</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx14"><?xmltex \def\ref@label{{Karttunen et~al.(2020)Karttunen, Kurppa, Auvinen, Hellsten, and
J{\"{a}}rvi}}?><label>Karttunen et al.(2020)Karttunen, Kurppa, Auvinen, Hellsten, and
Järvi</label><?label karttunen2020?><mixed-citation>Karttunen, S., Kurppa, M., Auvinen, M., Hellsten, A., and Järvi, L.:
Large-eddy simulation of the optimal street-tree layout for pedestrian-level
aerosol particle concentrations – A case study from a city-boulevard, Atmos.
Environ., 6, 100073, <ext-link xlink:href="https://doi.org/10.1016/j.aeaoa.2020.100073" ext-link-type="DOI">10.1016/j.aeaoa.2020.100073</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx15"><?xmltex \def\ref@label{Karttunen and Kurppa(2021a)}?><label>Karttunen and Kurppa(2021a)</label><?label Karttunen2021a?><mixed-citation>Karttunen, S. and Kurppa, M.: Input data for article “Large eddy simulation of the optimal street-tree layout for pedestrian-level aerosol particle concentrations”, Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.3556287" ext-link-type="DOI">10.5281/zenodo.3556287</ext-link>,   2021a.</mixed-citation></ref>
      <ref id="bib1.bibx16"><?xmltex \def\ref@label{Karttunen and Kurppa(2021b)}?><label>Karttunen and Kurppa(2021b)</label><?label Karttunen2021b?><mixed-citation>Karttunen, S. and Kurppa, M.: Input and output files and datasets for a LES case study of city-boulevard ventilation, Fairdata [data set], available at: <uri>http://urn.fi/urn:nbn:fi:att:ee275362-3f56-477c-bbbc-6fcacd9c7f95</uri>, last access:  22 November 2021.</mixed-citation></ref>
      <ref id="bib1.bibx17"><?xmltex \def\ref@label{{King et~al.(2018)King, Adcock, Annoni, and Dykes}}?><label>King et al.(2018)King, Adcock, Annoni, and Dykes</label><?label king2018?><mixed-citation>King, R. N., Adcock, C., Annoni, J., and Dykes, K.: Data-Driven Machine
Learning for Wind Plant Flow Modeling, J. Phys. Conf. Ser.,
1037, 072004, <ext-link xlink:href="https://doi.org/10.1088/1742-6596/1037/7/072004" ext-link-type="DOI">10.1088/1742-6596/1037/7/072004</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx18"><?xmltex \def\ref@label{{Krecl et~al.(2019)Krecl, Cipoli, Targino, de~Oliveira~Toloto,
Segersson, Ãlvaro Parra, Polezer, Godoi, and Gidhagen}}?><label>Krecl et al.(2019)Krecl, Cipoli, Targino, de Oliveira Toloto,
Segersson, Ãlvaro Parra, Polezer, Godoi, and Gidhagen</label><?label krecl2019?><mixed-citation>Krecl, P., Cipoli, Y. A., Targino, A. C., de Oliveira Toloto, M., Segersson,
D., Ãlvaro Parra, Polezer, G., Godoi, R. H. M., and Gidhagen, L.: Modelling
urban cyclists' exposure to black carbon particles using high spatiotemporal
data: A statistical approach, Sci. Total Environ., 679,
115–125, <ext-link xlink:href="https://doi.org/10.1016/j.scitotenv.2019.05.043" ext-link-type="DOI">10.1016/j.scitotenv.2019.05.043</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx19"><?xmltex \def\ref@label{{Kumar et~al.(2011)Kumar, Ketzel, Vardoulakis, Pirjola, and
Britter}}?><label>Kumar et al.(2011)Kumar, Ketzel, Vardoulakis, Pirjola, and
Britter</label><?label kumar2011?><mixed-citation>Kumar, P., Ketzel, M., Vardoulakis, S., Pirjola, L., and Britter, R.: Dynamics
and dispersion modelling of nanoparticles from road traffic in the urban
atmospheric environment – A review, J. Aerosol Sci., 42, 580–603,
<ext-link xlink:href="https://doi.org/10.1016/j.jaerosci.2011.06.001" ext-link-type="DOI">10.1016/j.jaerosci.2011.06.001</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx20"><?xmltex \def\ref@label{{Kurppa et~al.(2018)Kurppa, Hellsten, Auvinen, Raasch, Vesala, and
J{\"{a}}rvi}}?><label>Kurppa et al.(2018)Kurppa, Hellsten, Auvinen, Raasch, Vesala, and
Järvi</label><?label kurppa2018?><mixed-citation>Kurppa, M., Hellsten, A., Auvinen, M., Raasch, S., Vesala, T., and Järvi,
L.: Ventilation and Air Quality in City Blocks Using Large-Eddy
Simulation—Urban Planning Perspective, Atmosphere, 9, 65,
<ext-link xlink:href="https://doi.org/10.3390/atmos9020065" ext-link-type="DOI">10.3390/atmos9020065</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx21"><?xmltex \def\ref@label{{Kurppa et~al.(2019)Kurppa, Hellsten, Roldin, Kokkola, Tonttila,
Auvinen, Kent, Kumar, Maronga, and J\"{a}rvi}}?><label>Kurppa et al.(2019)Kurppa, Hellsten, Roldin, Kokkola, Tonttila,
Auvinen, Kent, Kumar, Maronga, and Järvi</label><?label kurppa2019?><mixed-citation>Kurppa, M., Hellsten, A., Roldin, P., Kokkola, H., Tonttila, J., Auvinen, M.,
Kent, C., Kumar, P., Maronga, B., and Järvi, L.: Implementation of the
sectional aerosol module SALSA2.0 into the PALM model system 6.0: model
development and first evaluation, Geosci. Model Dev., 12,
1403–1422, <ext-link xlink:href="https://doi.org/10.5194/gmd-12-1403-2019" ext-link-type="DOI">10.5194/gmd-12-1403-2019</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx22"><?xmltex \def\ref@label{Kurppa et al.(2021)}?><label>Kurppa et al.(2021)</label><?label Kurppa2021?><mixed-citation>Kurppa, M., Helssten, A., Auvinen, M., and Järvi, L.: Assessing pollutant ventilation in a city-boulevard using large-eddy simulation,  Fairdata [data set], available at: <uri>http://urn.fi/urn:nbn:fi:att:cfe1bd77-6697-44b5-bdd7-ee74f36c7dcd</uri>, last access: 22 November 2021.</mixed-citation></ref>
      <?pagebreak page7424?><ref id="bib1.bibx23"><?xmltex \def\ref@label{{Lambert(1992)}}?><label>Lambert(1992)</label><?label Diane_ZIP?><mixed-citation>
Lambert, D.: Zero-Inflated Poisson Regression, With an Application to Defects
in Manufacturing, Technometrics, 34, 1–14, 1992.</mixed-citation></ref>
      <ref id="bib1.bibx24"><?xmltex \def\ref@label{Lange et al.(2021)}?><label>Lange et al.(2021)</label><?label Lange2021?><mixed-citation>Lange, M., Suominen, H., Kurppa, M., Järvi, L., Oikarinen, E., Savvides, R., and Puolamäki, K.: Datasets of Air Pollutants on Boulevard Type Streets and Software to Replicate Large-Eddy Simulations of Air Pollutant Concentrations Along Boulevard-Type Streets (1.0.0), Zenodo [data set and code], <ext-link xlink:href="https://doi.org/10.5281/zenodo.3999302" ext-link-type="DOI">10.5281/zenodo.3999302</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx25"><?xmltex \def\ref@label{{Lelieveld et~al.(2015)Lelieveld, Evans, Fnais, Giannadaki, and
Pozzer}}?><label>Lelieveld et al.(2015)Lelieveld, Evans, Fnais, Giannadaki, and
Pozzer</label><?label lelieveld2015?><mixed-citation>Lelieveld, J., Evans, J. S., Fnais, M., Giannadaki, D., and Pozzer, A.: The
contribution of outdoor air pollution sources to premature mortality on a
global scale, Nature, 525, 367–371, <ext-link xlink:href="https://doi.org/10.1038/nature15371" ext-link-type="DOI">10.1038/nature15371</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx26"><?xmltex \def\ref@label{{Lelieveld et~al.(2019)Lelieveld, Klingm{\"{u}}ller, Pozzer,
P{\"{o}}schl, Fnais, Daiber, and M{\"{u}}nzel}}?><label>Lelieveld et al.(2019)Lelieveld, Klingmüller, Pozzer,
Pöschl, Fnais, Daiber, and Münzel</label><?label lelieveld2019?><mixed-citation>Lelieveld, J., Klingmüller, K., Pozzer, A., Pöschl, U., Fnais, M.,
Daiber, A., and Münzel, T.: Cardiovascular disease burden from ambient
air pollution in Europe reassessed using novel hazard ratio functions, Eur.
Heart J., 40, 1590–1596, <ext-link xlink:href="https://doi.org/10.1093/eurheartj/ehz135" ext-link-type="DOI">10.1093/eurheartj/ehz135</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx27"><?xmltex \def\ref@label{{Maronga et~al.(2015)Maronga, Gryschka, Heinze, Hoffmann,
Kanani-S{\"{u}}hring, Keck, Ketelsen, Letzel, S{\"{u}}hring, and
Raasch}}?><label>Maronga et al.(2015)Maronga, Gryschka, Heinze, Hoffmann,
Kanani-Sühring, Keck, Ketelsen, Letzel, Sühring, and
Raasch</label><?label maronga2015?><mixed-citation>Maronga, B., Gryschka, M., Heinze, R., Hoffmann, F., Kanani-Sühring, F.,
Keck, M., Ketelsen, K., Letzel, M. O., Sühring, M., and Raasch, S.: The
Parallelized Large-Eddy Simulation Model (PALM) version 4.0 for atmospheric
and oceanic flows: model formulation, recent developments, and future
perspectives, Geosci. Model Dev., 8, 2515–2551,
<ext-link xlink:href="https://doi.org/10.5194/gmd-8-2515-2015" ext-link-type="DOI">10.5194/gmd-8-2515-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx28"><?xmltex \def\ref@label{{Maronga et~al.(2020)Maronga, Banzhaf, Burmeister, Esch, Forkel,
Fr\"{o}hlich, Fuka, Gehrke, Geletic, Giersch, Gronemeier, Gro{\ss}, Heldens,
Hellsten, Hoffmann, Inagaki, Kadasch, Kanani-S\"{u}hring, Ketelsen, Khan,
Knigge, Knoop, Krc, Kurppa, Maamari, Matzarakis, Mauder, Pallasch, Pavlik,
Pfafferott, Resler, Rissmann, Russo, Salim, Schrempf, Schwenkel, Seckmeyer,
Schubert, S\"{u}hring, von Tils, Vollmer, Ward, Witha, Wurps, Zeidler, and
Raasch}}?><label>Maronga et al.(2020)Maronga, Banzhaf, Burmeister, Esch, Forkel,
Fröhlich, Fuka, Gehrke, Geletic, Giersch, Gronemeier, Groß, Heldens,
Hellsten, Hoffmann, Inagaki, Kadasch, Kanani-Sühring, Ketelsen, Khan,
Knigge, Knoop, Krc, Kurppa, Maamari, Matzarakis, Mauder, Pallasch, Pavlik,
Pfafferott, Resler, Rissmann, Russo, Salim, Schrempf, Schwenkel, Seckmeyer,
Schubert, Sühring, von Tils, Vollmer, Ward, Witha, Wurps, Zeidler, and
Raasch</label><?label maronga2019?><mixed-citation>Maronga, B., Banzhaf, S., Burmeister, C., Esch, T., Forkel, R., Fröhlich, D., Fuka, V., Gehrke, K. F., Geletič, J., Giersch, S., Gronemeier, T., Groß, G., Heldens, W., Hellsten, A., Hoffmann, F., Inagaki, A., Kadasch, E., Kanani-Sühring, F., Ketelsen, K., Khan, B. A., Knigge, C., Knoop, H., Krč, P., Kurppa, M., Maamari, H., Matzarakis, A., Mauder, M., Pallasch, M., Pavlik, D., Pfafferott, J., Resler, J., Rissmann, S., Russo, E., Salim, M., Schrempf, M., Schwenkel, J., Seckmeyer, G., Schubert, S., Sühring, M., von Tils, R., Vollmer, L., Ward, S., Witha, B., Wurps, H., Zeidler, J., and Raasch, S.: Overview of the PALM model system 6.0, Geosci. Model Dev., 13, 1335–1372, <ext-link xlink:href="https://doi.org/10.5194/gmd-13-1335-2020" ext-link-type="DOI">10.5194/gmd-13-1335-2020</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx29"><?xmltex \def\ref@label{{Murphy(2012)}}?><label>Murphy(2012)</label><?label murphy2013machine?><mixed-citation>
Murphy, K. P.: Machine Learning: A Probabilistic Perspective, The MIT Press,  Cambridge, Massachusetts,
2012.</mixed-citation></ref>
      <ref id="bib1.bibx30"><?xmltex \def\ref@label{{Nosek et~al.(2016)Nosek, Kuka{\v{c}}ka, Kellnerov{\'{a}}, Jur{\v{c}}{\'{a}}kov{\'{a}}, and Ja{\v{n}}our}}?><label>Nosek et al.(2016)Nosek, Kukačka, Kellnerová, Jurčáková, and Jaňour</label><?label nosek2016?><mixed-citation>Nosek, Š., Kukačka, L., Kellnerová, R., Jurčáková,
K., and Jaňour, Z.: Ventilation Processes in a Three-Dimensional Street
Canyon, Bound.-Lay. Meteorol., 159, 259–284,
<ext-link xlink:href="https://doi.org/10.1007/s10546-016-0132-2" ext-link-type="DOI">10.1007/s10546-016-0132-2</ext-link>, 2016.
</mixed-citation></ref><?xmltex \hack{\newpage}?>
      <ref id="bib1.bibx31"><?xmltex \def\ref@label{{Oikarinen et~al.(2021)Oikarinen, Tiittanen, Henelius, and
Puolam\"{a}ki}}?><label>Oikarinen et al.(2021)Oikarinen, Tiittanen, Henelius, and
Puolamäki</label><?label tiittanen2019estimating?><mixed-citation>Oikarinen, E., Tiittanen, H., Henelius, A., and Puolamäki, K.: Detecting
virtual concept drift of regressors without ground truth values, Data Min. Knowl. Disc., 35, 726–747, <ext-link xlink:href="https://doi.org/10.1007/s10618-021-00739-7" ext-link-type="DOI">10.1007/s10618-021-00739-7</ext-link>,
2021.</mixed-citation></ref>
      <ref id="bib1.bibx32"><?xmltex \def\ref@label{{Peng et~al.(2017)Peng, Lima, Teakles, Jin, Cannon, and
Hsieh}}?><label>Peng et al.(2017)Peng, Lima, Teakles, Jin, Cannon, and
Hsieh</label><?label peng2017?><mixed-citation>Peng, H., Lima, A. R., Teakles, A., Jin, J., Cannon, A. J., and Hsieh, W. W.:
Evaluating hourly air quality forecasting in Canada with nonlinear updatable
machine learning methods, Air Quality, Atmos. Health, 10, 195–211,
<ext-link xlink:href="https://doi.org/10.1007/s11869-016-0414-3" ext-link-type="DOI">10.1007/s11869-016-0414-3</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx33"><?xmltex \def\ref@label{{R Core Team(2020)}}?><label>R Core Team(2020)</label><?label rcoreteam2020?><mixed-citation>R Core Team: R: A Language and Environment for Statistical Computing, R
Foundation for Statistical Computing, Vienna, Austria,
avilable at: <uri>https://www.R-project.org/</uri> (last access: 22 November 2021), 2020.</mixed-citation></ref>
      <ref id="bib1.bibx34"><?xmltex \def\ref@label{{Rybarczyk and Zalakeviciute(2018)}}?><label>Rybarczyk and Zalakeviciute(2018)</label><?label rybarczyk2018?><mixed-citation>Rybarczyk, Y. and Zalakeviciute, R.: Machine Learning Approaches for Outdoor
Air Quality Modelling: A Systematic Review, Appl. Sci., 8, 2570,
<ext-link xlink:href="https://doi.org/10.3390/app8122570" ext-link-type="DOI">10.3390/app8122570</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx35"><?xmltex \def\ref@label{{Salim et~al.(2011)Salim, Buccolieri, Chan, and Sabatino}}?><label>Salim et al.(2011)Salim, Buccolieri, Chan, and Sabatino</label><?label salim2011?><mixed-citation>Salim, S. M., Buccolieri, R., Chan, A., and Sabatino, S. D.: Numerical
simulation of atmospheric pollutant dispersion in an urban street canyon:
Comparison between RANS and LES, J. Wind Eng. Ind. Aerod., 99, 103–113, <ext-link xlink:href="https://doi.org/10.1016/j.jweia.2010.12.002" ext-link-type="DOI">10.1016/j.jweia.2010.12.002</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx36"><?xmltex \def\ref@label{{Tominaga and Stathopoulos(2011)}}?><label>Tominaga and Stathopoulos(2011)</label><?label tominaga2011?><mixed-citation>Tominaga, Y. and Stathopoulos, T.: CFD modeling of pollution dispersion in a
street canyon: Comparison between LES and RANS, J. Wind Eng. Ind. Aerod.,
99, 340–348, <ext-link xlink:href="https://doi.org/10.1016/j.jweia.2010.12.005" ext-link-type="DOI">10.1016/j.jweia.2010.12.005</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx37"><?xmltex \def\ref@label{{{Van den Bossche} et~al.(2018){Van den Bossche}, Baets, Verwaeren,
Botteldooren, and Theunis}}?><label>Van den Bossche et al.(2018)Van den Bossche, Baets, Verwaeren,
Botteldooren, and Theunis</label><?label vandenbossche2018?><mixed-citation>Van den Bossche, J., Baets, B. D., Verwaeren, J., Botteldooren, D., and
Theunis, J.: Development and evaluation of land use regression models for
black carbon based on bicycle and pedestrian measurements in the urban
environment, Environ. Model. Softw., 99, 58–69,
<ext-link xlink:href="https://doi.org/10.1016/j.envsoft.2017.09.019" ext-link-type="DOI">10.1016/j.envsoft.2017.09.019</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx38"><?xmltex \def\ref@label{{{WHO}(2016)}}?><label>WHO(2016)</label><?label who2016?><mixed-citation>WHO (World Health Organization): Ambient air pollution: A global assessment of exposure and burden of
disease, available at: <uri>https://apps.who.int/iris/handle/10665/250141</uri> (last access: 22 November 2021), 2016.</mixed-citation></ref>
      <ref id="bib1.bibx39"><?xmltex \def\ref@label{{Yang et~al.(2018)Yang, Deng, Xu, and Wang}}?><label>Yang et al.(2018)Yang, Deng, Xu, and Wang</label><?label yang2018?><mixed-citation>Yang, W., Deng, M., Xu, F., and Wang, H.: Prediction of hourly PM<inline-formula><mml:math id="M148" display="inline"><mml:msub><mml:mi/><mml:mn mathvariant="normal">2.5</mml:mn></mml:msub></mml:math></inline-formula> using a
space-time support vector regression model, Atmos. Environ., 181, 12–19, <ext-link xlink:href="https://doi.org/10.1016/j.atmosenv.2018.03.015" ext-link-type="DOI">10.1016/j.atmosenv.2018.03.015</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx40"><?xmltex \def\ref@label{{Yuan et~al.(2014)Yuan, Ng, and Norford}}?><label>Yuan et al.(2014)Yuan, Ng, and Norford</label><?label yuan2014?><mixed-citation>Yuan, C., Ng, E., and Norford, L. K.: Improving air quality in high-density
cities by understanding the relationship between air pollutant dispersion and
urban morphologies, Build. Environ., 71, 245–258,
<ext-link xlink:href="https://doi.org/10.1016/j.buildenv.2013.10.008" ext-link-type="DOI">10.1016/j.buildenv.2013.10.008</ext-link>, 2014.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>Machine-learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Adams and Kanaroglou(2016)</label><mixed-citation>
Adams, M. D. and Kanaroglou, P. S.: Mapping real-time air pollution health risk
for environmental management: Combining mobile and stationary air pollution
monitoring with neural network models, J. Environ. Manag.,
168, 133–141, <a href="https://doi.org/10.1016/j.jenvman.2015.12.012" target="_blank">https://doi.org/10.1016/j.jenvman.2015.12.012</a>, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Araki et al.(2018)Araki, Shima, and Yamamoto</label><mixed-citation>
Araki, S., Shima, M., and Yamamoto, K.: Spatiotemporal land use random forest
model for estimating metropolitan NO<sub>2</sub> exposure in Japan, Sci. Total
Environ., 634, 1269–1277, <a href="https://doi.org/10.1016/j.scitotenv.2018.03.324" target="_blank">https://doi.org/10.1016/j.scitotenv.2018.03.324</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Auvinen et al.(2020)Auvinen, Boi, Hellsten, Tanhuanpää,
and Järvi</label><mixed-citation>
Auvinen, M., Boi, S., Hellsten, A., Tanhuanpää, T., and
Järvi, L.: Study of realistic urban boundary layer turbulence with
high-resolution large-eddy simulation, Atmosphere, 11, 201,
<a href="https://doi.org/10.3390/atmos11020201" target="_blank">https://doi.org/10.3390/atmos11020201</a>, 2020.
</mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Benoit(2011)</label><mixed-citation>
Benoit, K.: Linear regression models with logarithmic transformations, London
School of Economics, London, 22, 23–36, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Britter and Hanna(2003)</label><mixed-citation>
Britter, R. E. and Hanna, S. R.: Flow and dispersion in urban areas, Ann.
Rev. Fluid Mech., 35, 469–496,
<a href="https://doi.org/10.1146/annurev.fluid.35.101101.161147" target="_blank">https://doi.org/10.1146/annurev.fluid.35.101101.161147</a>, 2003.
</mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Cristianini and Shawe-Taylor(2000)</label><mixed-citation>
Cristianini, N. and Shawe-Taylor, J.: An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods, Cambridge University Press,  Cambridge, United Kingdom,
<a href="https://doi.org/10.1017/CBO9780511801389" target="_blank">https://doi.org/10.1017/CBO9780511801389</a>, 2000.
</mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Feng et al.(2019)Feng, Zheng, Gao, Zhang, Huang, Zhang, Luo, and
Fan</label><mixed-citation>
Feng, R., Zheng, H.-J., Gao, H., Zhang, A.-R., Huang, C., Zhang, J.-X., Luo,
K., and Fan, J.-R.: Recurrent Neural Network and random forest for analysis
and accurate forecast of atmospheric pollutants: A case study in Hangzhou,
China, J. Cleaner Product., 231, 1005–1015,
<a href="https://doi.org/10.1016/j.jclepro.2019.05.319" target="_blank">https://doi.org/10.1016/j.jclepro.2019.05.319</a>, 2019.
</mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Fernández-Delgado et al.(2014)Fernández-Delgado, Cernadas,
Barro, and Amorim</label><mixed-citation>
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D.: Do We Need
Hundreds of Classifiers to Solve Real World Classification Problems?, J.
Mach. Learn. Res., 15, 3133–3181, 2014.
</mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Gama et al.(2014)Gama, Žliobaitundefined, Bifet, Pechenizkiy, and
Bouchachia</label><mixed-citation>
Gama, J. A., Žliobaitundefined, I., Bifet, A., Pechenizkiy, M., and
Bouchachia, A.: A Survey on Concept Drift Adaptation, ACM Comput. Surv., 46, 44,
<a href="https://doi.org/10.1145/2523813" target="_blank">https://doi.org/10.1145/2523813</a>, 2014.
</mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Gómez-Dans et al.(2016)Gómez-Dans, Lewis, and
Disney</label><mixed-citation>
Gómez-Dans, J. L., Lewis, P. E., and Disney, M.: Efficient Emulation of
Radiative Transfer Codes Using Gaussian Processes and Application to Land
Surface Parameter Inferences, Remote Sensing, 8, <a href="https://doi.org/10.3390/rs8020119" target="_blank">https://doi.org/10.3390/rs8020119</a>,
2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Hastie et al.(2009)Hastie, Tibshirani, and Friedman</label><mixed-citation>
Hastie, T., Tibshirani, R., and Friedman, J.: The Elements of Statistical
Learning, Springer-Verlag, 2009.
</mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Heus et al.(2010)Heus, van Heerwaarden, Jonker, Pier Siebesma,
Axelsen, van den Dries, Geoffroy, Moene, Pino, de Roode, and Vilà-Guerau de
Arellano</label><mixed-citation>
Heus, T., van Heerwaarden, C. C., Jonker, H. J. J., Pier Siebesma, A., Axelsen,
S., van den Dries, K., Geoffroy, O., Moene, A. F., Pino, D., de Roode, S. R.,
and Vilà-Guerau de Arellano, J.: Formulation of the Dutch Atmospheric
Large-Eddy Simulation (DALES) and overview of its applications, Geosci. Model
Dev., 3, 415–444, <a href="https://doi.org/10.5194/gmd-3-415-2010" target="_blank">https://doi.org/10.5194/gmd-3-415-2010</a>, 2010.
</mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Hu et al.(2017)Hu, Rahman, Bhrugubanda, and
Sivaraman</label><mixed-citation>
Hu, K., Rahman, A., Bhrugubanda, H., and Sivaraman, V.: HazeEst:
Machine Learning Based Metropolitan Air Pollution Estimation From Fixed and
Mobile Sensors, IEEE Sensors J., 17, 3517–3525,
<a href="https://doi.org/10.1109/JSEN.2017.2690975" target="_blank">https://doi.org/10.1109/JSEN.2017.2690975</a>, 2017.
</mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>Karttunen et al.(2020)Karttunen, Kurppa, Auvinen, Hellsten, and
Järvi</label><mixed-citation>
Karttunen, S., Kurppa, M., Auvinen, M., Hellsten, A., and Järvi, L.:
Large-eddy simulation of the optimal street-tree layout for pedestrian-level
aerosol particle concentrations – A case study from a city-boulevard, Atmos.
Environ., 6, 100073, <a href="https://doi.org/10.1016/j.aeaoa.2020.100073" target="_blank">https://doi.org/10.1016/j.aeaoa.2020.100073</a>, 2020.
</mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Karttunen and Kurppa(2021a)</label><mixed-citation>
Karttunen, S. and Kurppa, M.: Input data for article “Large eddy simulation of the optimal street-tree layout for pedestrian-level aerosol particle concentrations”, Zenodo [data set], <a href="https://doi.org/10.5281/zenodo.3556287" target="_blank">https://doi.org/10.5281/zenodo.3556287</a>,   2021a.
</mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Karttunen and Kurppa(2021b)</label><mixed-citation>
Karttunen, S. and Kurppa, M.: Input and output files and datasets for a LES case study of city-boulevard ventilation, Fairdata [data set], available at: <a href="http://urn.fi/urn:nbn:fi:att:ee275362-3f56-477c-bbbc-6fcacd9c7f95" target="_blank"/>, last access:  22 November 2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>King et al.(2018)King, Adcock, Annoni, and Dykes</label><mixed-citation>
King, R. N., Adcock, C., Annoni, J., and Dykes, K.: Data-Driven Machine
Learning for Wind Plant Flow Modeling, J. Phys. Conf. Ser.,
1037, 072004, <a href="https://doi.org/10.1088/1742-6596/1037/7/072004" target="_blank">https://doi.org/10.1088/1742-6596/1037/7/072004</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Krecl et al.(2019)Krecl, Cipoli, Targino, de Oliveira Toloto,
Segersson, Ãlvaro Parra, Polezer, Godoi, and Gidhagen</label><mixed-citation>
Krecl, P., Cipoli, Y. A., Targino, A. C., de Oliveira Toloto, M., Segersson,
D., Ãlvaro Parra, Polezer, G., Godoi, R. H. M., and Gidhagen, L.: Modelling
urban cyclists' exposure to black carbon particles using high spatiotemporal
data: A statistical approach, Sci. Total Environ., 679,
115–125, <a href="https://doi.org/10.1016/j.scitotenv.2019.05.043" target="_blank">https://doi.org/10.1016/j.scitotenv.2019.05.043</a>, 2019.
</mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Kumar et al.(2011)Kumar, Ketzel, Vardoulakis, Pirjola, and
Britter</label><mixed-citation>
Kumar, P., Ketzel, M., Vardoulakis, S., Pirjola, L., and Britter, R.: Dynamics
and dispersion modelling of nanoparticles from road traffic in the urban
atmospheric environment – A review, J. Aerosol Sci., 42, 580–603,
<a href="https://doi.org/10.1016/j.jaerosci.2011.06.001" target="_blank">https://doi.org/10.1016/j.jaerosci.2011.06.001</a>, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Kurppa et al.(2018)Kurppa, Hellsten, Auvinen, Raasch, Vesala, and
Järvi</label><mixed-citation>
Kurppa, M., Hellsten, A., Auvinen, M., Raasch, S., Vesala, T., and Järvi,
L.: Ventilation and Air Quality in City Blocks Using Large-Eddy
Simulation—Urban Planning Perspective, Atmosphere, 9, 65,
<a href="https://doi.org/10.3390/atmos9020065" target="_blank">https://doi.org/10.3390/atmos9020065</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Kurppa et al.(2019)Kurppa, Hellsten, Roldin, Kokkola, Tonttila,
Auvinen, Kent, Kumar, Maronga, and Järvi</label><mixed-citation>
Kurppa, M., Hellsten, A., Roldin, P., Kokkola, H., Tonttila, J., Auvinen, M.,
Kent, C., Kumar, P., Maronga, B., and Järvi, L.: Implementation of the
sectional aerosol module SALSA2.0 into the PALM model system 6.0: model
development and first evaluation, Geosci. Model Dev., 12,
1403–1422, <a href="https://doi.org/10.5194/gmd-12-1403-2019" target="_blank">https://doi.org/10.5194/gmd-12-1403-2019</a>, 2019.
</mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Kurppa et al.(2021)</label><mixed-citation>
Kurppa, M., Helssten, A., Auvinen, M., and Järvi, L.: Assessing pollutant ventilation in a city-boulevard using large-eddy simulation,  Fairdata [data set], available at: <a href="http://urn.fi/urn:nbn:fi:att:cfe1bd77-6697-44b5-bdd7-ee74f36c7dcd" target="_blank"/>, last access: 22 November 2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Lambert(1992)</label><mixed-citation>
Lambert, D.: Zero-Inflated Poisson Regression, With an Application to Defects
in Manufacturing, Technometrics, 34, 1–14, 1992.
</mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Lange et al.(2021)</label><mixed-citation>
Lange, M., Suominen, H., Kurppa, M., Järvi, L., Oikarinen, E., Savvides, R., and Puolamäki, K.: Datasets of Air Pollutants on Boulevard Type Streets and Software to Replicate Large-Eddy Simulations of Air Pollutant Concentrations Along Boulevard-Type Streets (1.0.0), Zenodo [data set and code], <a href="https://doi.org/10.5281/zenodo.3999302" target="_blank">https://doi.org/10.5281/zenodo.3999302</a>, 2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Lelieveld et al.(2015)Lelieveld, Evans, Fnais, Giannadaki, and
Pozzer</label><mixed-citation>
Lelieveld, J., Evans, J. S., Fnais, M., Giannadaki, D., and Pozzer, A.: The
contribution of outdoor air pollution sources to premature mortality on a
global scale, Nature, 525, 367–371, <a href="https://doi.org/10.1038/nature15371" target="_blank">https://doi.org/10.1038/nature15371</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Lelieveld et al.(2019)Lelieveld, Klingmüller, Pozzer,
Pöschl, Fnais, Daiber, and Münzel</label><mixed-citation>
Lelieveld, J., Klingmüller, K., Pozzer, A., Pöschl, U., Fnais, M.,
Daiber, A., and Münzel, T.: Cardiovascular disease burden from ambient
air pollution in Europe reassessed using novel hazard ratio functions, Eur.
Heart J., 40, 1590–1596, <a href="https://doi.org/10.1093/eurheartj/ehz135" target="_blank">https://doi.org/10.1093/eurheartj/ehz135</a>, 2019.
</mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Maronga et al.(2015)Maronga, Gryschka, Heinze, Hoffmann,
Kanani-Sühring, Keck, Ketelsen, Letzel, Sühring, and
Raasch</label><mixed-citation>
Maronga, B., Gryschka, M., Heinze, R., Hoffmann, F., Kanani-Sühring, F.,
Keck, M., Ketelsen, K., Letzel, M. O., Sühring, M., and Raasch, S.: The
Parallelized Large-Eddy Simulation Model (PALM) version 4.0 for atmospheric
and oceanic flows: model formulation, recent developments, and future
perspectives, Geosci. Model Dev., 8, 2515–2551,
<a href="https://doi.org/10.5194/gmd-8-2515-2015" target="_blank">https://doi.org/10.5194/gmd-8-2515-2015</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Maronga et al.(2020)Maronga, Banzhaf, Burmeister, Esch, Forkel,
Fröhlich, Fuka, Gehrke, Geletic, Giersch, Gronemeier, Groß, Heldens,
Hellsten, Hoffmann, Inagaki, Kadasch, Kanani-Sühring, Ketelsen, Khan,
Knigge, Knoop, Krc, Kurppa, Maamari, Matzarakis, Mauder, Pallasch, Pavlik,
Pfafferott, Resler, Rissmann, Russo, Salim, Schrempf, Schwenkel, Seckmeyer,
Schubert, Sühring, von Tils, Vollmer, Ward, Witha, Wurps, Zeidler, and
Raasch</label><mixed-citation>
Maronga, B., Banzhaf, S., Burmeister, C., Esch, T., Forkel, R., Fröhlich, D., Fuka, V., Gehrke, K. F., Geletič, J., Giersch, S., Gronemeier, T., Groß, G., Heldens, W., Hellsten, A., Hoffmann, F., Inagaki, A., Kadasch, E., Kanani-Sühring, F., Ketelsen, K., Khan, B. A., Knigge, C., Knoop, H., Krč, P., Kurppa, M., Maamari, H., Matzarakis, A., Mauder, M., Pallasch, M., Pavlik, D., Pfafferott, J., Resler, J., Rissmann, S., Russo, E., Salim, M., Schrempf, M., Schwenkel, J., Seckmeyer, G., Schubert, S., Sühring, M., von Tils, R., Vollmer, L., Ward, S., Witha, B., Wurps, H., Zeidler, J., and Raasch, S.: Overview of the PALM model system 6.0, Geosci. Model Dev., 13, 1335–1372, <a href="https://doi.org/10.5194/gmd-13-1335-2020" target="_blank">https://doi.org/10.5194/gmd-13-1335-2020</a>, 2020.
</mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Murphy(2012)</label><mixed-citation>
Murphy, K. P.: Machine Learning: A Probabilistic Perspective, The MIT Press,  Cambridge, Massachusetts,
2012.
</mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Nosek et al.(2016)Nosek, Kukačka, Kellnerová, Jurčáková, and Jaňour</label><mixed-citation>
Nosek, Š., Kukačka, L., Kellnerová, R., Jurčáková,
K., and Jaňour, Z.: Ventilation Processes in a Three-Dimensional Street
Canyon, Bound.-Lay. Meteorol., 159, 259–284,
<a href="https://doi.org/10.1007/s10546-016-0132-2" target="_blank">https://doi.org/10.1007/s10546-016-0132-2</a>, 2016.

</mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Oikarinen et al.(2021)Oikarinen, Tiittanen, Henelius, and
Puolamäki</label><mixed-citation>
Oikarinen, E., Tiittanen, H., Henelius, A., and Puolamäki, K.: Detecting
virtual concept drift of regressors without ground truth values, Data Min. Knowl. Disc., 35, 726–747, <a href="https://doi.org/10.1007/s10618-021-00739-7" target="_blank">https://doi.org/10.1007/s10618-021-00739-7</a>,
2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Peng et al.(2017)Peng, Lima, Teakles, Jin, Cannon, and
Hsieh</label><mixed-citation>
Peng, H., Lima, A. R., Teakles, A., Jin, J., Cannon, A. J., and Hsieh, W. W.:
Evaluating hourly air quality forecasting in Canada with nonlinear updatable
machine learning methods, Air Quality, Atmos. Health, 10, 195–211,
<a href="https://doi.org/10.1007/s11869-016-0414-3" target="_blank">https://doi.org/10.1007/s11869-016-0414-3</a>, 2017.
</mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>R Core Team(2020)</label><mixed-citation>
R Core Team: R: A Language and Environment for Statistical Computing, R
Foundation for Statistical Computing, Vienna, Austria,
avilable at: <a href="https://www.R-project.org/" target="_blank"/> (last access: 22 November 2021), 2020.
</mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Rybarczyk and Zalakeviciute(2018)</label><mixed-citation>
Rybarczyk, Y. and Zalakeviciute, R.: Machine Learning Approaches for Outdoor
Air Quality Modelling: A Systematic Review, Appl. Sci., 8, 2570,
<a href="https://doi.org/10.3390/app8122570" target="_blank">https://doi.org/10.3390/app8122570</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Salim et al.(2011)Salim, Buccolieri, Chan, and Sabatino</label><mixed-citation>
Salim, S. M., Buccolieri, R., Chan, A., and Sabatino, S. D.: Numerical
simulation of atmospheric pollutant dispersion in an urban street canyon:
Comparison between RANS and LES, J. Wind Eng. Ind. Aerod., 99, 103–113, <a href="https://doi.org/10.1016/j.jweia.2010.12.002" target="_blank">https://doi.org/10.1016/j.jweia.2010.12.002</a>, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Tominaga and Stathopoulos(2011)</label><mixed-citation>
Tominaga, Y. and Stathopoulos, T.: CFD modeling of pollution dispersion in a
street canyon: Comparison between LES and RANS, J. Wind Eng. Ind. Aerod.,
99, 340–348, <a href="https://doi.org/10.1016/j.jweia.2010.12.005" target="_blank">https://doi.org/10.1016/j.jweia.2010.12.005</a>, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Van den Bossche et al.(2018)Van den Bossche, Baets, Verwaeren,
Botteldooren, and Theunis</label><mixed-citation>
Van den Bossche, J., Baets, B. D., Verwaeren, J., Botteldooren, D., and
Theunis, J.: Development and evaluation of land use regression models for
black carbon based on bicycle and pedestrian measurements in the urban
environment, Environ. Model. Softw., 99, 58–69,
<a href="https://doi.org/10.1016/j.envsoft.2017.09.019" target="_blank">https://doi.org/10.1016/j.envsoft.2017.09.019</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>WHO(2016)</label><mixed-citation>
WHO (World Health Organization): Ambient air pollution: A global assessment of exposure and burden of
disease, available at: <a href="https://apps.who.int/iris/handle/10665/250141" target="_blank"/> (last access: 22 November 2021), 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Yang et al.(2018)Yang, Deng, Xu, and Wang</label><mixed-citation>
Yang, W., Deng, M., Xu, F., and Wang, H.: Prediction of hourly PM<sub>2.5</sub> using a
space-time support vector regression model, Atmos. Environ., 181, 12–19, <a href="https://doi.org/10.1016/j.atmosenv.2018.03.015" target="_blank">https://doi.org/10.1016/j.atmosenv.2018.03.015</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Yuan et al.(2014)Yuan, Ng, and Norford</label><mixed-citation>
Yuan, C., Ng, E., and Norford, L. K.: Improving air quality in high-density
cities by understanding the relationship between air pollutant dispersion and
urban morphologies, Build. Environ., 71, 245–258,
<a href="https://doi.org/10.1016/j.buildenv.2013.10.008" target="_blank">https://doi.org/10.1016/j.buildenv.2013.10.008</a>, 2014.
</mixed-citation></ref-html>--></article>
