Submitted as: model description paper 31 Aug 2020

Submitted as: model description paper | 31 Aug 2020

Review status: this preprint is currently under review for the journal GMD.

Machine learning models to replicate large-eddy simulations of air pollutant concentrations along boulevard-type streets

Moritz Lange1, Henri Suominen1, Mona Kurppa2, Leena Järvi2,3, Emilia Oikarinen1, Rafael Savvides1, and Kai Puolamäki1,2 Moritz Lange et al.
  • 1Department of Computer Science, University of Helsinki, Finland
  • 2Institute of Atmospheric and Earth System Research (INAR)/Physics, Faculty of Science, University of Helsinki, Finland
  • 3Helsinki Institute of Sustainability Science, Faculty of Science, University of Helsinki, Finland

Abstract. Running large-eddy simulations (LES) can be burdensome and computationally too expensive from the application point-of-view for example to support urban planning. In this study, regression models are used to replicate modelled air pollutant concentrations from LES in urban boulevards. We study the performance of regression models and discuss how to detect situations where the models are applied outside their training domain and their outputs cannot be trusted. Regression models from 10 different model families are trained and a cross-validation methodology is used to evaluate their performance and to find the best set of features needed to reproduce the LES outputs. We also test the regression models on an independent testing dataset. Our results suggest that in general, log-linear regression gives the best and most robust performance on new independent data. It clearly outperforms the dummy model which would predict constant concentrations for all locations (mRMSE of 0.76 vs 1.78 of the dummy model). Furthermore, we demonstrate that it is possible to detect concept drift, i.e., situations where the model is applied outside its training domain and a new LES run may be necessary to obtain reliable results. Regression models can be used to replace LES simulations in estimating air pollutant concentrations, unless higher accuracy is needed. In order to have reliable results, it is however important to do the model and feature selection carefully to avoid over-fitting and to use methods to detect the concept drift.

Moritz Lange et al.

Status: open (extended)
Status: open (extended)
AC: Author comment | RC: Referee comment | SC: Short comment | EC: Editor comment
[Subscribe to comment alert] Printer-friendly Version - Printer-friendly version Supplement - Supplement

Moritz Lange et al.

Data sets

Input data for article "Large eddy simulation of the optimal street-tree layout for pedestrian-level aerosol particle concentrations" Sasu Mikael Karttunen and Mona Liisa Vilhelmiina Kurppa

Model code and software

Datasets of Air Pollutants on Boulevard Type Streets and Software to Replicate Large-Eddy Simulations of Air Pollutant Concentrations Along Boulevard-Type Streets Moritz Lange, Henri Suominen, Mona Kurppa, Leena Järvi, Emilia Oikarinen, Rafael Savvides, and Kai Puolamäki

Moritz Lange et al.


Total article views: 467 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
355 105 7 467 38 18 11
  • HTML: 355
  • PDF: 105
  • XML: 7
  • Total: 467
  • Supplement: 38
  • BibTeX: 18
  • EndNote: 11
Views and downloads (calculated since 31 Aug 2020)
Cumulative views and downloads (calculated since 31 Aug 2020)

Viewed (geographical distribution)

Total article views: 384 (including HTML, PDF, and XML) Thereof 381 with geography defined and 3 with unknown origin.
Country # Views %
  • 1
Latest update: 11 Apr 2021
Short summary
This study aims to replicate computationally expensive high-resolution large-eddy simulations (LES) with regression models to simulate urban air quality and pollutant dispersion. The model development, including feature selection, model training and cross-validation, and detection of concept drift, has been described in detail. Of the models applied, log-linear regression shows the best performance. A regression model can replace LES unless high accuracy is needed.