Machine learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets

Running large-eddy simulations (LES) can be burdensome and computationally too expensive from the application point-of-view for example to support urban planning. In this study, regression models are used to replicate modelled air pollutant concentrations from LES in urban boulevards. We study the performance of regression models and discuss how to detect situations where the models are applied outside their training domain and their outputs cannot be trusted. Regression models from 10 different model families are trained and a cross-validation methodology is used to evaluate their performance and to 5 find the best set of features needed to reproduce the LES outputs. We also test the regression models on an independent testing dataset. Our results suggest that in general, log-linear regression gives the best and most robust performance on new independent data. It clearly outperforms the dummy model which would predict constant concentrations for all locations (mRMSE of 0.76 vs 1.78 of the dummy model). Furthermore, we demonstrate that it is possible to detect concept drift, i.e., situations where the model is applied outside its training domain and a new LES run may be necessary to obtain reliable results. Regression 10 models can be used to replace LES simulations in estimating air pollutant concentrations, unless higher accuracy is needed. In order to have reliable results it is however important to do the model and feature selection carefully to avoid over-fitting and to use methods to detect the concept drift.

. Simulation domains of the LES output data applied in the study: a) Four city-planning alternatives V1-4 investigated in KU18 (Kurppa et al., 2018), and b) City-boulevard scenario S1 and its surroundings studied in KA20 (Karttunen et al., 2020). Green dots illustrate trees. The city boulevard is 54 m and 58 m wide in KU18 and KA20, respectively. et al., 2019;Peng et al., 2017). Machine learning studies applying LES data have so far been mainly restricted to turbulence closure modelling (e.g., King et al., 2018).
In this study, the application of machine learning for emulating LES outputs of local scale air pollutant dispersion in urban areas is investigated. We use LES outputs from two different studies conducted in different boulevard-type street canyons in Helsinki. Specifically, we create appropriate features for the neighbourhoods around the street canyons from the inputs of 60 the LES that are used to train the machine learning models, and then evaluate the performance and reliability of different algorithms. The motivation is to approximate the computationally expensive simulations with machine learning models that are faster to evaluate. The ultimate goal is to develop a model that can easily be applied to support urban planning.
This study is structured as follows: First, the LES datasets and feature construction are described. Then brief descriptions of the used machine learning models and the training and evaluation process are provided. Finally, the applications, limitations,

Forward feature selection
After calculating all potentially useful input features, a subset of features to be used with each model is selected using forward selection (Hastie et al., 2009). Forward selection is a feature selection algorithm in which a model is trained iteratively with progressively larger feature subsets. Initially, every feature is used as a single predictor to train the model, and the best performing feature is selected. The model is then re-trained using this feature along with every other feature as a second predictor, and the 140 best performing second feature is selected. This process is repeated until either all features are selected or no additional feature  improves the model. Forward selection limits the search space of all possible feature combinations considerably, and while it does not guarantee to find the globally best subset of features, it finds a local optimum. An advantage of forward selection that is relevant to this study is that it avoids selecting strongly correlated predictors, such as the different-sized convolutions of pollutant emissions (Table 1).

145
The best performing features are selected according to a selection criterion. Here the selection criterion is the cross-validation root mean squared error (RMSE). Cross-validation is a technique for evaluating generalisation performance of statistical models and for detecting over-fitting. In cross-validation, the data are split into k random subsets, out of which k − 1 are used to train the model, and one is used to validate the model. Due to the spatial auto-correlation in the LES data, a random split by sampling data points from all maps does not lead to statistically independent subsets. In order to ensure maximal independence 150 between the training and validation data, the random split is performed at the level of city blocks. Each cross-validation split is trained with three different city plans under a given wind direction and validated with the fourth city plan under the other wind direction. This is repeated for all four city plans using both wind directions. As an example, one split uses city plans V1, V2 and V3 with the wind direction 90 • as training data, and V4 with the wind direction 225 • as validation data to validate predictions. Using each city plan for a given wind direction as unique validation data results in eight splits. The aggregated 155 cross-validation error is then calculated as the error of all combined predictions of all splits.

Model descriptions
The applicability of ten common regression models trained on KU18 is examined, from the simplest linear model to the powerful support vector regression (SVR) model (Table 2). Logarithmic support vector regression Support vector regression modelling log(pc + 1) e1071

Zero-inflated Poisson regression Combination of logistic regression and Poisson regression pscl
Linear models Four generalised linear models are considered: linear regression, log-linear regression (Benoit, 2011), Poisson 160 regression, and zero-inflated Poisson regression (Lambert, 1992). Generalised linear models model the target variable as a function of linear combinations of features. Linear models are relatively simple, which limits their flexibility, but makes them more interpretable. For example, if the features are normalised, then the regression coefficient of a feature communicates how much a change of one unit affects the mean of the target variable, given that all other features are constant.

165
Tree-based models Three tree-based models are considered: decision trees, random forest, gradient boosting with decision trees (Hastie et al., 2009). Decision trees model the target variable using simple if-else rules, which makes them interpretable. Random forest is an ensemble method that aggregates the predictions of multiple decision trees trained on random subsets of data and features. Gradient boosting is also an ensemble method, in which multiple decision trees are trained sequentially on the results of previous trees, correcting their weaknesses. Ensemble methods achieve high 170 prediction accuracy by pooling the predictions of multiple models. In particular, the random forest is among the best performing regression models (Fernández-Delgado et al., 2014). However, the complexity of ensemble models means that they are not interpretable.

Support vector regression (SVR, Cristianini and Shawe-Taylor, 2000) is a powerful regression method that implicitly trans-
forms the data into a higher dimensional feature space. This enables SVR to model interactions and conditional de-175 pendencies between features, and hence utilize more information compared to simpler models. However, as with the complex tree models, the complexity of SVR leads to difficulties in understanding how the relations in the data are utilized by the model to estimate pc. In addition to standard SVR, a log-transformed SVR is also used to enforce positive predictions for pc.
Gaussian process regression (GPR, Murphy, 2012) is a non-parametric approach to regression. GPR is a Bayesian method 180 that uses a Gaussian process with a known covariance as a prior to infer the posterior predictive distribution of the unobserved values. It can be considered as an Bayesian alternative for other kernel methods, such as the SVR (Murphy, 2012). We use a squared exponential kernel as the covariance matrix for the model. This results in points having similar predicted values if they are close to each other in the feature space. The generally good performance of the GPR is also complemented by its built-in ability to account for uncertainty.

185
Dummy model A dummy model is used as a baseline model for reference. The dummy model simply predicts mean pc of the training data, regardless of any features.

Performance measure
The root mean squared error (RMSE) is a standard performance measure for model evaluation (Rybarczyk and Zalakeviciute, 2018). Here, a modified version of RMSE is used, due to the different scales of KU18 and KA20. We define the multiplicative 190 minimum-RMSE based on a linear transformation of the predictions as mRMSE(pc,pc) = min a∈R RMSE(a · pc,pc) where pc is a vector of the observed pollutant concentrations, andpc is a vector with the corresponding predictions. Using mRMSE is equivalent to using RMSE after scaling pollutant concentrations with a multiplicative factor a that minimises the RMSE.
In addition to mRMSE, two other performance measures are presented in the results tables. The cross-validation RMSE (cv-RMSE, described in Section 2.3) is presented as a performance measure on the training data, when using the optimal set of 195 features selected with forward selection. Scaled bias is defined as mean(a · pc −pc), where a is the same multiplicative factor as in mRMSE. Scaled bias shows whether models over-or underestimate the pollutant concentrations.
Note that model performance on spatial data cannot be meaningfully summarised into a single performance measure. The spatial distribution of prediction errors can be examined using residual plots ( Figure 5).

200
This chapter describes the training process of the models with features selected using forward selection. Based on the training results, the best performing regression models are selected and summarized. The models are subsequently evaluated on an independent dataset (KA20). Lastly, the reliability of model predictions is assessed for cases when LES results are not available with a concept drift detection algorithm.

Model training 205
For each model, forward feature selection on KU18 provides the optimal set of features (as described in Section 2.3). Figure 4 shows the results of feature selection. Features are iteratively added (x-axis) and cv-RMSE is computed (y-axis). For each model, the features are selected such that cv-RMSE is minimized. In order to limit the computation time of feature selection, q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.6  SVR and GPR were trained on a random subset of 2,500 data points (out of the total 472,991), rather than the whole crossvalidation split.

210
After feature selection, each model is trained on the whole training data (all KU18 city plans), using the optimal features from feature selection. SVR and GPR are, as in feature selection, only trained on 2,500 randomly selected data points.
SVR and GPR have additional parameters, whose optimal values were computed using grid search. Grid search is a model tuning technique in which the model is trained using all parameter values on a discrete grid of the parameter space. The selection criterion for the optimal parameter values was cv-RMSE (as with feature selection).

215
Note that some models allow for negative predictions (negative pollutant concentrations), which is physically impossible.
One approach for avoiding negative predictions is to clip model outputs to a minimum of zero after prediction. For the purposes of this study, however, negative predictions are retained, since their magnitude is relevant for error estimation.

Model selection
Out of the ten models described in Section 2.4, three are selected as the best performing for replicating the LES outputs: 220 logarithmic support vector regression, Gaussian process regression, and log-linear regression. Table 3 compares the three selected models. Table 4 lists the performance of all ten models with respect to cross-validation RMSE, mRMSE, and scaled bias (described in Section 2.5). Performance evaluation at this stage is based on the cross-validation errors computed during forward feature selection on KU18. Final model evaluation on an independent dataset (KA20) is performed in Section 3.3.
The three best performing models are: 225 Requiring low dimensional training data means that the user will need to provide fewer features with their data, minimising the expense of data preparation. A third advantage is that the log transformation ensures non-negative predictions for pc. Although the standard SVR does not ensure non-negative predictions, it requires the same number of features and 230 offers almost the same RMSE.
Gaussian process regression The RMSE of the GPR is close to that of the log-SVR, making it one of the strongest models as well. In addition, it has previously been used to predict simulator outputs (Gómez-Dans et al., 2016).
Log-linear regression As a linear model, the log-linear regression is useful if model interpretability is required (for more details on its interpretability, see Section 2.4). What separates the log-linear regression from the other linear models is 235 that it uses the fewest number of features, while also ensuring positive predictions through log-transformation, similar to the logarithmic SVR. It works as a simpler counterpart to the more powerful methods.

Model evaluation
As a final test, the models are evaluated on an independent dataset to obtain a more accurate estimate of their performance in a real-world urban planning situation. The models cannot be evaluated solely based on the training data, since model overfitting 240 would not be detected, and the error estimation in real-world situations would be poor. For the evaluation, we use the models trained on both wind directions and all city plans of KU18 (as described in Section 3.1) and we evaluate them on both wind directions of KA20.
We use the mRMSE and the scaled bias defined in Section 2.5 for KA20, with the results for PM 2.5 concentrations listed in Table 4. Due to the scaling, the mRMSE does not allow direct comparisons between the errors on KU18 and KA20. On 245 the evaluation data, we can however compare the models to the dummy model and see that, with the exception of the Poisson regressions, all models clearly outperform the dummy model, although the performance is more varied than with the training data. The best performing model is the log-linear model (mRMSE = 0.76). The log-SVM and GPR also performed well (mRMSE = 0.87 and mRMSE = 0.91, respectively) when compared to the dummy model (mRMSE = 1.78).
Out of the selected models, the log-linear model is preferrable. Not only does it perform the best out of all the three models 250 selected but it is also the simplest and it runs the fastest. Its simplicity is likely the reason for its good performance given the notable differences between the training and evaluation data. Although log-SVM and GPR both perform better than the average model, compared to the other models, their performance is worse in the evaluation data when comparing to the cross-validation procedure. This could be a sign of slight over-fitting to the training data or sensitivity of the different distribution of data on the testing dataset.

255
The interpretation of the scaled bias is not straightforward due to the scaling, but it shows that all of the models overestimate pc in the evaluation data with the SVR overestimating the least. The dummy model is, unsurprisingly, the most biased.
The scaled residuals can be seen in Figure 5 for the selected models. At a glance they may seem to be similar for different models but there are differences between the predictions displayed even by the different mRMSE. The differences are due to the dissimilar ways of building a model and the features selected in Section 2.3. All three selected models acquire most of the 260 error on the boulevard, while the outskirts are predicted more consistently. This is not a surprise since pc is much lower outside of the boulevard. The figure also show that the log-linear model has more balanced residuals while the more complex SVR and GPR are able to achieve lower residuals on the outskirts but perform relatively poorly on the boulevard. The residuals also show that none of the models are fully capable of capturing details in the pollutant dispersion that arise due to fine-scale flow patterns. The wind is coming from the left. Residuals are calculated after scaling with the respective optimal a for mRMSE calculation. Notice that although the residuals look similar, the predictions have notable differences. The scaled pc ranges up to a maximum of 7.28. The residuals can also be contrasted to the mean squared error of the dummy model, which is 1.78.

Concept drift detection
It is important to know whether the results obtained generalize to different city plans.  Originally Drifter has been used for time series data. We adapt it for spatial data here by selecting the segments of the training data not to be temporally close-by data points, but spatially close-by points. We hence divide the map into k × k squares that are used as the segments. An example of such a segment can be seen in Fig. 6. Drifter is run for all models considered in In order to keep the results comparable with Section 3, we also scale the evaluation dataset (KA20) by multiplying it with the constant that minimizes RMSE. This, if the evaluation data is not segmented, is equivalent to using the same multiplicative 290 minimum-RMSE error measure as in Section 3. This decision does not affect the concept drift indicator but it will make the simulation error of the training and evaluation more comparable. We train Drifter using all eight simulation set-ups of KU18 (i.e., four city plans and two wind directions) and evaluate it on both wind directions of KA20. We use 100 m × 100 m squares as our segments in the training data with each square overlapping 25% out of four other training segments, and 25 m × 25 m squares in the evaluation data with no overlap. The concept drift indicator d(x) is chosen to be the tenth smallest error estimate.

295
The overlapping scheme used is the same as in the original article, but since the data is two dimensional, it is applied for both  dimensions (50% · 50% = 25%). These parameters are chosen with a grid search for the log-linear model with all features (a situation exhibiting concept drift).
In Figure 7a-c we show the concept drift indicator value and RMSE for each 25 m × 25 m segment in the evaluation data alongside the same sized segments in the training data for our final models in the same figure. With all three models considered, 300 the evaluation data concept drift indicator is indeed correlated with its RMSE and thus can be used to estimate it. We also notice that for all three models the evaluation segments lie in the same area as the training segments indicating the lack of concept drift. In addition, we notice that the concept drift indicator values are larger on the boulevard which corresponds to the fact that the RMSE is indeed larger there. Therefore, the concept drift indicator is useful in detecting areas of large RMSE even when if the ground truth (LES output) are not known.