Running large-eddy simulations (LESs) can be burdensome and computationally too expensive from the application point of view, for example, to support urban planning. In this study, regression models are used to replicate modelled air pollutant concentrations from LES in urban boulevards. We study the performance of regression models and discuss how to detect situations where the models are applied outside their training domain and their outputs cannot be trusted.
Regression models from 10 different model families are trained and a cross-validation methodology is used to evaluate their performance and to find the best set of features needed to reproduce the LES outputs. We also test the regression models on an independent testing dataset.
Our results suggest that in general, log-linear regression gives the best and most robust performance on new independent data. It clearly outperforms the dummy model which would predict constant concentrations for all locations (multiplicative minimum RMSE (mRMSE) of

Exposure to ambient air pollution leads to cardiovascular and pulmonary diseases, and is estimated to cause 3 million premature deaths worldwide every year

Successful modelling of urban air pollutant dispersion necessitates taking into account the detailed properties of adjacent buildings and vegetation in the area of interest as well as in its surroundings. To date, high-resolution dispersion modelling has mainly been based on physical modelling techniques, of which computational fluid dynamics (CFD) models, notably Reynolds-averaged Navier–Stokes (RANS) equations and large-eddy simulation (LES), are the most applicable tools for the purpose. CFD models solve the flow and dispersion around individual buildings, and with constantly increasing computational resources the modelling domains can currently be extended to cover entire neighbourhoods and even cities

Machine learning allows finding a relationship between a target variable, e.g. the concentration of air pollutants in a certain location, and its predictors, which are often called features. These types of machine-learning models are called regression models. The models are trained on a specified training data which by some rule, e.g. maximizing the likelihood given that the relationship is linear with normally distributed noise, observe a relationship between the target variable and its features. To evaluate the trustworthiness of the results, often a part of the available data is not used to train the model but to evaluate it. These “unseen” evaluation data give an estimate of the model performance in a realistic urban planning scenario. Perhaps the largest advantage of regression models compared to the CFD models is their speed. However, the increase in speed comes at a cost of accuracy. Another disadvantage is that accurate predictions require the predicted data to follow approximately the same distribution as the training data, reducing their use to only modelling setups similar to those that they have been trained in.

Most of the previous studies on developing a statistical air pollution model using machine learning have been based on field measurements, and the spatiotemporal distribution of pollutants has been assessed by utilizing multiple stationary sites in model training

In this study, the application of machine learning for emulating LES outputs of local-scale air pollutant dispersion in urban areas is investigated. We use LES outputs from two different studies conducted in different boulevard-type street canyons in Helsinki. Specifically, we create appropriate features for the neighbourhoods around the street canyons from the inputs of the LES that are used to train the machine-learning models and then evaluate the performance and reliability of different algorithms. The motivation is to approximate the computationally expensive simulations with machine-learning models that are faster to evaluate. The ultimate goal is to develop a model that can easily be applied to support urban planning.

This study is structured as follows: first, the LES datasets and feature construction are described. Then brief descriptions of the used machine-learning models and the training and evaluation process are provided. Finally, the applications, limitations and future work are discussed.

In this chapter, we introduce the LES datasets used in this study and explain our pre-processing steps. Then, we present the features available to the regression models. Finally, we describe how the optimal set of features is chosen by forward feature selection and describe the used performance measures.
All analyses were carried out in R version 3.6.2

Simulation domains of the LES output data applied in the study:

LES models resolve the three-dimensional prognostic equations for momentum and scalar variables. In LES, all turbulence scales larger than a chosen filter width are resolved directly. The smaller scales, which should represent less than 10 % of the turbulence energy

The first study by

The second study by

Both studies apply a grid spacing of 1.0

The machine-learning models in this study are trained with KU18 and evaluated with KA20. Using KA20 for evaluation mimics a realistic urban planning scenario, where KA20 would correspond to a new city plan considered by the urban planner.
KU18 is selected as the training dataset due to greater variety in building layouts compared to KA20, in which the variation is mainly limited to different street-tree scenarios. Clear differences in the building layouts lead to deviant pollutant dispersion and concentration distributions, which improves the generalization performance. An alternative approach for training would entail creating training and evaluation data using random samples from both KU18 and KA20. However, random sampling is impractical in this case due to significant differences between KU18 and KA20. KU18 simulates dispersion qualitatively and assumes weightless and inert particles that imitate air pollutants in general. In contrast, KA20 models realistic aerosol particle concentration values and includes realistic model physics for the aerosol dry deposition. These differences affect the scaling and distributions of the simulated particle concentrations (Fig.

Histograms of air pollutant concentrations in KU18 and KA20. Note the log scale of the counts on the

The LES outputs KU18 and KA20 are pre-processed into a suitable format for training the regression models. The pre-processing is divided into two parts: aggregating the target variable over time and height, and constructing expressive features from the LES inputs.

In regression, the target variable is predicted using predictor variables (features). Here, the target variable is the (air) pollutant concentration

In addition to time averaging, we also restrict the spatial extent in which the regression models are trained. The area of interest in this study is the city boulevard in the middle of the maps, which is the same as in the original studies that generated KU18 and KA20. In KA20, the boulevard is surrounded by artificial buildings to imitate the aerodynamic roughness of an suburban environment, which however are of no interest and are therefore omitted in the model development. Figure

The target variable is predicted using features that are defined for each

The surrogate features provide information about spatial dependencies in the modelling domain, reducing the need for the regression models to explicitly model these dependencies. For instance, applying convolutions of various sizes on pollutant emissions creates surrogate features of average pollutant emission densities over a spatial neighbourhood, weighted by proximity to the point in question. Other features, such as the height-to-width ratio of a street canyon, are created based on domain knowledge. Incorporating domain knowledge is important, since well-crafted input features largely determine the quality of modelling results. Furthermore, understandable features aid experts in interpreting the physical meaning of the results.

List of features used in the regression models.

The map area cutout for which pollutant concentrations are modelled (in blue) for each of the city plans (V1–V4) from

After calculating all potentially useful input features, a subset of features to be used with each model is selected using forward selection

The best-performing features are selected according to a selection criterion. Here, the selection criterion is the cross-validation root mean squared error (RMSE). Cross-validation is a technique for evaluating generalization performance of statistical models and for detecting overfitting. In cross-validation, the data are split into

Models and their implementation in R (version 3.6.2.) with all but two model pollutant concentrations (

The applicability of 10 common regression models trained on KU18 is examined, from the simplest linear model to the powerful support vector regression (SVR) model (Table

Four generalized linear models are considered: linear regression, log-linear regression

Three tree-based models are considered: decision trees, random forest and gradient boosting with decision trees

SVR

GPR

A dummy model is used as a baseline model for reference. The dummy model simply predicts mean

The RMSE is a standard performance measure for model evaluation

mRMSE therefore depends only on the relative magnitudes of the pollutant concentrations and it is invariant to linear scaling of the training or evaluation data. For a new evaluation dataset, we could either use the same multiplicative constant – if the scaling in the new evaluation dataset is expected to be identical to the scaling in the old evaluation data – or find a new multiplicative constant.

In addition to mRMSE, two other performance measures are presented in the results tables. The cross-validation RMSE (cv-RMSE, described in Sect.

Note that model performance on spatial data cannot be meaningfully summarized into a single performance measure. The spatial distribution of prediction errors can be examined using residual plots (Fig.

This chapter describes the training process of the models with features selected using forward selection. Based on the training results, the best-performing regression models are selected and summarized. The models are subsequently evaluated on an independent dataset (KA20). Lastly, the reliability of model predictions is assessed for cases when LES results are not available with a concept drift detection algorithm.

Feature selection results on KU18 for all regression models.

For each model, forward feature selection on KU18 provides the optimal set of features (as described in Sect.

After feature selection, each model is trained on the whole training data (all KU18 city plans), using the optimal features from feature selection. SVR and GPR are, as in feature selection, only trained on 2500 randomly selected data points.

SVR and GPR have additional parameters whose optimal values were computed using grid search. Grid search is a model tuning technique in which the model is trained using all parameter values on a discrete grid of the parameter space. The selection criterion for the optimal parameter values was cv-RMSE (as with feature selection).

Note that some models allow for negative predictions (negative pollutant concentrations), which is physically impossible. A possible approach for avoiding negative predictions is either to use a transformation that allows only positive predictions (for example, log-linear regression instead of linear regression) or to clip model outputs to a minimum of zero after prediction. For the purposes of this study, however, negative predictions are retained, since their magnitude is relevant for error estimation.

Out of the 10 models described in Sect.

The three best-performing models are as follows:

The logarithmic SVR has the smallest cross-validation RMSE, and as such, its predictions are the most accurate on the training data. Additionally, it only requires a small number of features to achieve this strong performance. Requiring low-dimensional training data means that the user will need to provide fewer features with their data, minimizing the expense of data preparation. A third advantage is that the log transformation ensures non-negative predictions for

The RMSE of the GPR is close to that of the log-SVR, making it one of the strongest models as well. In addition, it has previously been used to predict simulator outputs

As a linear model, the log-linear regression is useful if model interpretability is required (for more details on its interpretability, see Sect.

Comparison of selected models based on KU18. Performance refers to the cross-validated RMSE; number of features refers to the optimal number selected during forward feature selection. Prior use means the model has been used for similar tasks in literature. A “

As a final test, the models are evaluated on an independent dataset to obtain a more accurate estimate of their performance in a real-world urban planning situation. The models cannot be evaluated solely based on the training data, since model overfitting would not be detected, and the error estimation in real-world situations would be poor. For the evaluation, we use the models trained on both wind directions and all city plans of KU18 (as described in Sect.

We use the mRMSE and the scaled bias defined in Sect.

Out of the selected models, the log-linear model is preferable. Not only does it perform the best out of all the three models selected but it is also the simplest and it runs the fastest. Its simplicity is likely the reason for its good performance given the notable differences between the training and evaluation data.

Although log-SVR and GPR both perform better than the average model, compared to the other models, their performance is worse in the evaluation data when comparing to the cross-validation procedure. This could be a sign of slight overfitting to the training data or sensitivity of the different distribution of data on the testing dataset.

The interpretation of the scaled bias is not straightforward due to the scaling, but it shows that all of the models overestimate

The scaled residuals can be seen in Fig.

Model residuals of PM

Cross validation and evaluation error for all models, obtained with their respective optimal set of features. Cross-validation RMSE is on KU18, while mRMSE and scaled bias are calculated on PM

It is important to know whether the results obtained generalize to different city plans. Because of the high computational cost of running LES, we often do not have access to the LES output values of a new plan, and hence we cannot assess the prediction error directly in such a case. We can, however, use the Drifter algorithm by

Example of a single evaluation segment (light blue square) applied in the concept drift analysis for KA20 using the Drifter algorithm.

Originally, the Drifter algorithm has been used for time series data. We adapt it for spatial data here by selecting the segments of the training data not to be temporally close-by data points but spatially close-by points. We hence divide the map into

The relationship between the concept drift indicator and observed RMSE in a given segment in KU18 (circles) and KA20 (crosses) with

A map showing the concept drift indicator

In order to keep the results comparable to Sect.

In Fig.

Another example of the concept drift detection is given by testing Drifter with a sub-optimally performing model. If the same procedure is done for the log-linear model with all of the surrogate features and not with the set selected by forward selection, the model overfits. Unlike in the models above, we see from Fig.

This study demonstrates that machine-learning methods trained with LES data can be used to model street-level pollutant concentrations in a city-boulevard-type urban neighbourhood. The accuracy of the models is explored with an independent evaluation dataset to ensure their applicability in urban planning for new, similar types of neighbourhoods.

The log-linear regression has the greatest potential for replicating LES results even with a relatively small amount of data. It also has much potential in helping to understand which urban features govern local pollutant concentrations. The kernelized methods, SVR and GPR, show moderate performance and perform generally well in capturing the local mean concentration. However, all three have trouble in representing smaller-scale details in the concentration fields linked to turbulence. Furthermore, all models perform worse on the boulevard than in its surroundings. Still, these models beat the dummy model by a notable margin (e.g. RMSE

No previous studies on applying LES air pollution data to train a machine-learning model exist, and therefore direct comparison is unfeasible. Also comparing to studies applying spatial air quality measurements

A downside of the developed models is that, when applied to a new city plan, they require the plan to lead to a similar statistical distribution of the pollutant concentrations as the ones used for training, which means that the models have to cope with moderate amounts of concept drift. We can, however, detect this to avoid potentially false predictions that may occur when applying the models on data that do not follow the training distribution. Yet another limitation is the small number of simulation setups of the data: there are only eight different simulation setups (i.e. four city plans and two wind directions) to train with, which rules out methods requiring large annotated training datasets, such as some applications of deep learning. With more data, the models could potentially reach completely another level of accuracy. Another way to improve the accuracy would be to have even better and more capturing features. If very accurate results are needed, running a new LES is still the best approach. Still, model predictions are accurate enough for many purposes, such as to study pollutant exposure and to support urban planning.

The developed methods can be used to further probe new meteorological conditions and city plans. In a future study, these models could help us to understand how simple changes into the layout of the city plan, e.g. a new building, affect the local air pollutant concentrations. Eventually similar models could also be used to understand the complicated phenomena in simple urban areas.

To conclude, we have explored using different machine-learning models how to emulate air pollutant concentrations as simulated using LES. We use LESs made over two different boulevard-type street canyons to study the impact of building-block layouts on air pollutant concentrations. We examine the performance of 10 machine-learning methods by using site-specific features to predict the surface-level concentrations over the neighbourhoods. A total of 20 features are determined from the LES inputs and outputs. The results show how the studied machine-learning methods are able to produce the mean pollutant concentrations. Further, concept drift detection is used to detect areas where the model cannot be trusted and more simulation runs may be needed.

Detailed descriptions of features in Table

The code to reproduce the regression models is available at

The supplement related to this article is available online at:

LJ, MK, KP and EO designed the concept of the study. ML and HS pre-processed the LES outputs, prepared and conducted the machine-learning simulations and statistical analyses with contributions from EO, RS and KP. LJ and MK provided expert advice on the LES model inputs and outputs. All co-authors participated in writing the manuscript with contributions from all co-authors.

The authors declare that they have no conflict of interest.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

For financial support we would like to thank the Academy of Finland (profiling action 3 and decisions 326280 and 326339), Helsinki Institute for Information Technology HIIT, the Doctoral Programme in Atmospheric Sciences (ATM-DP), and the Doctoral Programme in Computer Science (DoCS) at the University of Helsinki, and SMart URBan Solutions for air quality, disasters and city growth (SMURBS, no. 689443) funded by ERA-NET-Cofund project under ERA-PLANET.

This research has been supported by the Academy of Finland (grant nos. 326280, 326339, 320182), the Helsinki Institute for Information Technology HIIT, the Doctoral Programme in Atmospheric Sciences (ATM-DP), University of Helsinki, the Doctoral Programme in Computer Sciences (DoCS), University of Helsinki, and the ERA-NET-Cofund, ERA-PLANET (grant no. 689443).Open-access funding was provided by the Helsinki University Library.

This paper was edited by Adrian Sandu and reviewed by two anonymous referees.