Submitted as: development and technical paper
06 Jul 2022
Submitted as: development and technical paper | 06 Jul 2022
Status: this preprint is currently under review for the journal GMD.

Development of a regional feature selection-based machine learning system (RFSML v1.0) for air pollution forecasting over China

Li Fang1, Jianbing Jin1, Arjo Segers2, Hai Xiang Lin3,4, Mijie Pang1, Cong Xiao5, Tuo Deng4, and Hong Liao1 Li Fang et al.
  • 1Jiangsu Key Laboratory of Atmospheric Environment Monitoring and Pollution Control, Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, School of Environmental Science and Engineering, Nanjing University of Information Science and Technology, Nanjing, Jiangsu, China
  • 2TNO, Department of Climate, Air and Sustainability, The Netherlands
  • 3Institute of Environmental Sciences, Leiden University, The Netherlands
  • 4Delft Institute of Applied Mathematics, Delft University of Technology, Delft, the Netherlands
  • 5Key Laboratory of Petroleum Engineering, Ministry of Education, China University of Petroleum, Beijing, China

Abstract. With the explosive growth of atmospheric data, machine learning models have achieved great success in air pollution forecasting because of their higher computational efficiency than the traditional chemical transport models. However, in previous studies, new prediction algorithms have been tested only at stations or in a small region; a large-scale air quality forecasting model remains lacking to date. Huge dimensionality also means that redundant input data may lead to increased complexity and therefore the over-fitting of machine learning models. Feature selection is a key topic in machine learning development, but it has not yet been explored in atmosphere-related applications. In this work, a regional feature selection-based machine learning (RFSML) system was developed, which is capable of predicting air quality in the short-term with high accuracy at the national scale. Ensemble-Shapley additive global importance analysis is combined with the RFSML system to extract significant regional features and eliminate redundant variables at an affordable computational expense. The significance of the regional features is also explained physically. Compared with a standard machine learning system fed with relative features, the RFSML system driven by the selected key features results in superior interpretability, less training time, and more accurate predictions. This study also provides insights into the difference in interpretability among machine learning models (i.e., random forest, gradient boosting, and multi-layer perceptron models).

Li Fang et al.

Status: open (until 03 Sep 2022)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2022-134', Anonymous Referee #1, 03 Aug 2022 reply
  • RC2: 'Comment on gmd-2022-134', Anonymous Referee #2, 10 Aug 2022 reply

Li Fang et al.


Total article views: 284 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
220 57 7 284 20 2 2
  • HTML: 220
  • PDF: 57
  • XML: 7
  • Total: 284
  • Supplement: 20
  • BibTeX: 2
  • EndNote: 2
Views and downloads (calculated since 06 Jul 2022)
Cumulative views and downloads (calculated since 06 Jul 2022)

Viewed (geographical distribution)

Total article views: 253 (including HTML, PDF, and XML) Thereof 253 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 10 Aug 2022
Short summary
This study proposes a regional feature selection-based machine learning system to predict short-term air quality in China. The system has a tool that can figure out the importance of input data for better prediction. It provides large-scale air quality prediction that exhibits improved interpretability, less training costs, and higher accuracy compared with a standard machine learning system. It can act as an early warning for citizens and reduce exposure to PM2.5 and other air pollutants.