Development of a regional feature selection-based machine learning system (RFSML v1.0) for air pollution forecasting over China

. With the explosive growth of atmospheric data, machine learning models have achieved great success in air pollution forecasting because of their higher computational efﬁciency than the traditional chemical transport models. However, in previous studies, new prediction algorithms have only been tested at stations or in a small region; a large-scale air quality forecasting model remains lacking to date. Huge dimensionality also means that redundant input data may lead to increased complexity and therefore the over-ﬁtting of machine learning models. Feature selection is a key topic in machine learning development, but it has not yet been explored in atmosphere-related applications. In this work, a regional feature selection-based machine learning (RFSML) system was developed, which is capable of predicting air quality in the short term with high accuracy at the national scale. Ensemble-Shapley additive global importance analysis is combined with the RFSML system to extract signiﬁcant regional features and eliminate redundant variables at an affordable computational expense. The signiﬁcance of the regional features is also explained physically. Compared with a standard machine learning system fed with relative features, the RFSML system driven by the selected key features results in superior interpretability, less training time, and more accurate predictions. This study also provides insights into the difference in interpretability among machine learning models (i.e., random forest, gradient boosting, and multi-layer perceptron models)


Introduction
With ongoing economic development and modern industrialization, the subsequent air pollution poses serious threats to resident health (Liu and Diamond, 2005;Li et al., 2014). After tobacco and high blood pressure, air pollution has ranked third in risk factors for death and disability in China over the past few decades (Murray et al., 2020). The primary air pollutants in China are particulate matter (PM), sulfur dioxide (SO 2 ), carbon monoxide (CO), nitrogen oxides (NO x ) and ozone (O 3 ) (Song et al., 2017b). PM 2.5 or respirable PM in air with an aerodynamic diameter below 2.5 µm is the primary air pollutant, and it has attracted considerable attention from researchers (Zhai et al., 2019). Exposure to either long-term or short-term PM 2.5 is related to respiratory symptoms, lung disease, cardiovascular disease, premature death, and other adverse health effects (Pui et al., 2014;Di et al., 2019). Burnett et al. (2018) and Song et al. (2017a) reported that PM 2.5 pollution in winter, particularly in northern China, is severe. It accounted for 15.5 % (1.7 million) of all deaths in China in 2015, despite an improvement in air quality since 2013. In recent studies, the global exposure mortality model has estimated that 140 200 premature deaths from 2015 to 2019 can be attributed to long-term exposure to PM 2.5 (Hao et al., 2021). An accurate air quality forecast (e.g., forecasting PM 2.5 ) is therefore valuable to policy makers and health professionals for epidemiological control (Xue et al., 2019).
In addition, it can provide an early warning for residents, particularly for children, the elderly, and people with respiratory or cardiovascular problems .
The development of an air pollution forecasting model is possible, as atmospheric chemistry and physical rules have been explored and are now understood in depth (Sun and Li, 2020b). In addition, our ever-increasing computational power can support the complex and heavy computational tasks required for this type of model (Reichstein et al., 2019). Deterministic models, such as chemical transport models (CTMs), and data-driven methods are commonly employed in forecasting (Cobourn, 2010;Xu et al., 2021). In several studies, air pollution forecasting has been performed using mainstream air quality CTMs, such as the Weather Research and Forecasting model with Chemistry (Grell et al., 2005;Zhou et al., 2017), Community Multiscale Air Quality model , and GEOS-Chem (Bey et al., 2001;Jeong and Park, 2018). These CTMs can reproduce real atmospheric situations (Hutzell and Luecken, 2008;Shtein et al., 2020); however, they exhibit several shortcomings. One of the most difficult setbacks is the high uncertainty in emission inventories (Huang et al., 2021), which is a great challenge given the variety of contributing sources, complexity of the spatial-temporal profiles, and lack of reliable in situ measurements (M. . Additionally, an idealistic deterministic model requires a delicate and thorough understanding of physical and chemical processes in the atmosphere (Sun and Li, 2020a) and an enormous computational capacity to resolve fine-scale variabilities. Therefore, CTMs alone have failed to meet the requirements for an effective air quality early warning system.
In contrast to CTMs, data-driven methods that do not require a profound knowledge of the complex composition or structure of the atmosphere are also widely utilized in atmospheric modeling (Ziomas et al., 1995;Xi et al., 2015). Many of these methods have been employed for air pollution forecasting, including multiple linear regression (Sawaragi et al., 1979), nonlinear regression models such as principal component regression (Shishegaran et al., 2020), hidden Markov models (Sun et al., 2013), support vector machine (Abu Awad et al., 2017), and artificial neural networks (Fernando et al., 2012). Of these methods, machine learning models have gained the greatest popularity because of their capacity to learn complex and nonlinear relationships by assimilating "big" training datasets (Masih, 2019;Leufen et al., 2021). Machine learning has brought great opportunities and challenges to the geophysical research community (Yu and Ma, 2021).
With the explosive growth of data in earth science, the superiority of machine learning for massive data applications has become increasingly prominent (Reichstein et al., 2019). The most representative example is to perform predictions for a target site using a machine learning model trained via a long-term series of in situ historical measurements. The China Ministry of Environmental Protection (MEP) has es-tablished many ground-based stations measuring the primary pollutants since 2013 (Zhai et al., 2019). At present, the monitoring network comprises more than 1500 field stations covering all of China, as can be seen in Fig. 1. The richness of air quality observations from the monitoring network provides valuable training data and stimulates the development of machine learning air quality forecasting in China (Xu et al., 2021). Previous studies on air pollution forecasting in China have utilized various machine learning models with the ground-based MEP air quality dataset. For example, X.  utilized a long short-term memory (LSTM) neural network extended model to predict PM 2.5 concentrations at a maximum forecast horizon of 24 h for air quality monitoring stations in Beijing, China. Wu et al. (2020) proposed a composite prediction system based on an LSTM neural network to predict daily PM 2.5 / PM 10 in Wuhan. Zhang et al. (2020) established a hybrid model, integrating deep learning with multi-task learning, to predict hourly PM 2.5 concentrations in three different districts of Lanzhou. Ke et al. (2021) utilized four machine learning models to develop an air quality forecasting system that can automatically find the best model and hyperparameter combination for the next 3 d air quality forecast in seven megacities of China. These works are highly valuable in exploring novel methods of air quality prediction relative to conventional CTMs. To the best of our knowledge, the aforementioned studies on air pollution forecasting solely focused on a few monitoring stations, typical cities, or small regions, while national-level air quality predictions remain lacking. The challenges in national-level forecasting include substantial temporal and spatial variances (Song et al., 2017b) in air pollution and enormous computational power requirements.
The curse of dimensionality is a common obstruction in modeling; i.e., an increasing amount of input data leads to rapidly increasing complexity, and prediction algorithms are susceptible to over-fitting (Rodriguez-Galiano et al., 2012). Therefore, considerable research has focused on reducing the dimensionality of input data by selecting only significant variables and eliminating redundancy. The methods of this research can be classified into three categories: the filter method (e.g., a correlation matrix using the Pearson Correlation), wrapper method (e.g., recursive feature elimination), and embedded method (e.g., Lasso regularization) (Chandrashekar and Sahin, 2014). These methods can reduce the adverse effects of irregular variables or noise while retaining prediction performance (Guyon and Elisseeff, 2003). They also save computing resources for model training. However, in previous studies on air quality forecasting, filter methods such as Pearson correlation coefficients or the maximal information coefficient (MIC) (Kinney and Atwal, 2014) were commonly utilized for input selection. These input selection methods can help improve the performance of machine learning models; however, they all have serious limitations. For example, universal meteorological variables that highly correlate with PM 2.5 in a large region of China are difficult to find using Pearson correlation coefficients because they can vary substantially both spatially and temporally (Zhai et al., 2019). MIC is the most employed method for capturing linear and nonlinear correlations between variable pairs (Chen et al., 2016). However, it cannot consider relevance and redundancy simultaneously (Sun et al., 2018). Furthermore, MIC is computationally intensive .
Machine learning algorithms are often considered "black box" models that learn the input-output relationship from immense training samples (Casalicchio et al., 2019). Many researchers have devoted enormous efforts to developing and implementing tools to interpret machine learning models. Among these tools, game-theoretic formulations of feature significance are the most widely utilized because they can capture the interactions among features (Shapley, 1952), and they may be the only solution satisfying the four "favorable and fair" axioms (Fryer et al., 2021). Several scholars have conducted in-depth studies on distinguishing feature significance based on the Shapley value (Shapley, 1952). For example, Park and Park (2021) utilized the Shapley additive explanation (SHAP) approach (Lundberg and Lee, 2017a) to interpret multiple machine learning models and found that most of the models have similar features. Golizadeh Akhlaghi et al. (2021) successfully interpreted the feature contributions of the guideless irregular dew-point cooler on the predicted parameters based on SHAP. In addition to SHAP, which explains individual predictions, Covert et al. (2020) proposed a novel method that can explain model behavior across the entire dataset (global interpretability), called Shapley additive global importance (SAGE). SHAP and SAGE both utilize the Shapley value; however, compared with SHAP, SAGE can simultaneously eliminate larger subsets of redundant features (Covert et al., 2020). Additionally, SAGE extracts features from the conditional distribution instead of the marginal distribution because the latter may lead to breaking feature dependencies and producing unlikely feature combinations (Lundberg and Lee, 2017a). Furthermore, investigating the feature importance based on model performance (Jothi et al., 2021) has been verified as a meaningful and effective approach for interpreting data-driven models and is popular in computer science (Altmann et al., 2010). However, this method has rarely been applied to air quality forecasting using machine learning tools.
In the present study, the first version of a regional feature selection-based machine learning system (RFSML v1.0) is developed. The system can predict short-term air quality with high accuracy in China. In this study, the RFSML system predicts the primary air pollutant (PM 2.5 ) concentration over every target site from the China MEP air quality monitoring network by learning its implicit trend from long-term series records. This method can be extended to other airborne pollutant predictions in future studies. SAGE analysis is adopted to interpret valuable features and exclude redundant inputs to avoid over-fitting the model during training. Because the SAGE calculations are more time-consuming than the model training, as explained in Sect. 3.1, they are not repeated for every target site but are implemented in limited ensemble sites that are randomly selected in a given region. China was divided into five densely populated regions, according to the air pollution patterns, which are consistent with the Clean Air Action target regions released by the Chinese State Council, as discussed in Sect. 2.2.3. The top three critical features in the ensemble SAGE calculations were utilized as the input features for the implicit trend model training for each site. The robustness of the regional feature selection was tested over three widely utilized machine learning models, i.e., random forest (RF), gradient boosting (GB), and multi-layer perceptron (MLP) models, and four forecasting horizons (6, 12, 18, and 24 h).
The remainder of the paper is organized as follows: the composition of the data used in this study and the preprocessing method are introduced in Sect. 2. Then, the three machine learning models and their hyperparameter choices utilized in this study are described. The principles of SAGE and the details of the SAGE ranking-based regional feature selection are described at the end of Sect. 2. In Sect. 3, the computational costs of SAGE and machine learning model training are detailed. Then, the results of feature selection in each region are presented and analyzed. The prediction performance of RFSML is evaluated and compared with that of the standard machine learning process. Finally, the conclusions and future prospects are provided in Sect. 4.

Model, data, and methods
The components of the RFSML method that are used to forecast PM 2.5 concentrations are described in the following sections.

Model domain and data
The RFSML system forecasts air pollution levels in the vicinity of a monitoring station. This forecasting uses machine learning by examining the variability in the available station datasets. The monitoring network consisted of 1588 stations, for data collected in 2019, at the locations displayed in Fig. 1. Because the station network is dense, pollution-level forecasting can be performed for nearly any location in eastern China.
The input data for machine learning consisted of hourly averaged air pollutant measurements (e.g., PM 2.5 , CO, SO 2 , and NO 2 ) from the Chinese MEP monitoring network, meteorological reanalysis data from ERA5-Land (Muñoz Sabater et al., 2021), atmospheric composition data from the Copernicus Atmosphere Monitoring Service (CAMS) reanalysis (Inness et al., 2019) provided by the European Centre for Medium-Range Weather Forecasts (ECMWF), and emission data from the Multi-resolution Emission Inventory for China (MEIC) inventory, with time factors applied at an hourly res- olution. The input data are summarized in Table 1. The variables in the datasets are correlated with and may drive the PM 2.5 concentration and are therefore useful predictors.
Data from 2018 and 2019 were used in the experiments. The first 15 690 h (from 1 January 2018 to 15 October 2019) was used for model training and cross-validation, and the actual tests were performed using the remaining 1824 h of data from 15 October to 30 December 2019. Our RFSML system can of course operate in a rolling way; additional forecasts in a less polluted period, April 2020, are performed with the models similarly trained using the recent 2-year data.

Air pollutant observations
The observed air pollutant concentrations at the stations were used as inputs (NO 2 , SO 2 , and CO) and target variables (PM 2.5 ) in the model training. The available time series of PM 10 observations was missing many values and was therefore excluded from the model. Additionally, O 3 observations were excluded because these data exhibit a diurnal cycle that substantially differs from the PM 2.5 target concentrations.
Missing data occurred for each of the studied pollutants because of equipment failure, incorrect sensor readings, and improper operation.  2019), it was shown that the observations from surrounding monitoring stations can be utilized to insert suitable values for missing data through imputation. Data imputation tools, such as cubic interpolation, have gained popularity for enhancing monotone data (Fritsch and Carlson, 1980).
In this study, the K-nearest neighbor (KNN) classification method (Zhang, 2012) and cubic imputation were combined to create an uninterrupted time series. The KNN algorithm is illustrated as Algorithm S1 in the Supplement. The KNN algorithm was implemented using the following steps: 1. Monitoring stations with over 20 % missing data were excluded from the training and prediction because such large amounts of missing data are not believed to be filled with sufficient accuracy.
2. For each station, the number of monitoring stations within a radius of 0.8 • was calculated (following the empirical choices suggested in Jin et al., 2019). If fewer than three surrounding stations were available, the station was excluded from the training and forecasting. If three or four surrounding stations were found, these were all selected, while four stations were selected randomly if more than four surrounding stations were found.
3. For each target station, a geographic inverse distance weighting technique (Bartier and Keller, 1996) was used to estimate the missing values using the observed values from the surrounding stations.
After KNN interpolation, the amount of missing data in the PM 2.5 time series was reduced to approximately 4.5 %, as illustrated in Fig. 2b. Because there were cases where nearby stations and the target station simultaneously exhibited missing data, some instances of missing values remained after KNN interpolation. Therefore, cubic imputation (Kincaid et al., 2009) was employed to insert values for the remaining missing data. Outliers generated by the cubic imputation were replaced with the minimum or maximum of the original series. A total of 1263 monitoring stations exhibited no missing data after applying interpolation. Both the mean and standard deviation of the homogenized time series were similar to those of the original data, as illustrated in Fig. S1 in the Supplement.

Air pollutant forecast product and meteorological variables
The CAMS reanalysis (Inness et al., 2019) provides threedimensional simulations of the atmospheric composition obtained by combining a global atmospheric chemistry model and observations. Therefore, it is expected to surpass pure model-based prediction accuracy. Selected concentrations of trace gases and aerosols from the CAMS reanalysis were inputs for the PM 2.5 predictor. The PM 2.5 simulations in this dataset were also used as a benchmark for the RFSML prediction. We obtained the 3-hourly reanalysis data of four air pollutant concentrations (pm2p5, no2, so2, and co), which are reanalysis data (pm2p5, no2, so2, and co) of four ground ob-servations mentioned above, in China from 2018 to 2019. The 3 h temporal resolution of the CAMS reanalysis data is firstly interpolated into 1 h resolution by cubic imputation. Then continuous time series of features at the monitoring stations are extracted from the interpolated 1 h data at a resolution of 0.75 • × 0.75 • using the nearest mapping. Meteorological variables, as illustrated in Table 2, were obtained from ERA5-Land data (Copernicus Climate Change Service (C3S), 2017) at a horizontal resolution of 0.1 • × 0.1 • and an hourly temporal resolution for 2018 and 2019. The data are available from the Climate Data Store via https://doi.org/10.24381/cds.e2161bac (Muñoz Sabater, 2021). The time series of meteorological variables used for the machine learning are extracted from this product using the nearest mapping method.

Emission inventory
MEIC, the most popular anthropogenic emission inventory in China (M. , has been validated to provide consistent aerosol precursor loading for satellite observations (Fan et al., 2018). It has been widely employed to quantify the air pollution in multi-atmosphere chemical models. The latest inventory of 2017 from MEIC version 1.3 was obtained via http://meicmodel.org/index.html (last access: October 2021) for use in this study. Based on the emission source height distribution, 24 h distribution, and  version 2 of the Regional Acid Deposition Model (Zimmermann and Poppe, 1996) chemical reaction scheme, the original monthly emission data were processed into hourly emission rates. Considering their correlation with PM 2.5 , nine pollutant species were selected as machine learning predictor inputs, as displayed in Table 3.
2.2 The RFSML system 2.2.1 System framework Figure 3 displays the framework of the proposed RFSML and a standard machine learning system. Note that a standard machine learning system refers to a machine learning system without any feature selection. Standard machine learning is conducted as follows. First, all observations and datasets related to PM 2.5 are collected, and then the missing values are interpolated into the original dataset. Next, the appropriate machine learning model is selected, and the continuous data time series is reformed into the required input structure. The model is then trained repeatedly until the appropriate hyperparameters are obtained, and finally, predictions are made with the trained model. Given an input x n that consists of individual features (x 1 , x 2 , . . ., x n ), a predictor F is utilized in a supervised learning task to predict the target variable y.
A time series regression, such as rolling forecast, can be ex-pressed as follows: where, at any instant t, the input vector storing n individual features in the previous t p h is utilized to forecast the target PM 2.5 concentrationŷ with a horizon of h h. The forecast predictor F represents the machine learning model (RF, GB, or MLP) trained using the historical data. Details on the selection of t p and h are provided in Sect. 2.2.2. As mentioned before, some features are residual, and the feature subset can provide sufficient predictive power and less noise for F. Thus, the proposed RFSML utilized SAGE to obtain the optimal feature subsets. Considering the computational efficiency, we divided the total national air quality monitoring stations into six types, each of which randomly selected the air quality monitoring stations for feature selection. Given any feature subset x s = {x 1 , x 2 , . . ., x s }, the machine learning models can be described as follows:

Machine learning models
Different machine learning algorithms have been used to forecast PM 2.5 because they can provide promising approaches to handle complex nonlinear relationships. Each algorithm exhibits advantages and drawbacks. Of the machine learning models, typical boosting (e.g., GB) and bagging (e.g., RF) algorithms are widely applied in regression analysis using a set of decision trees. Additionally, artificial neural network models (e.g., MLP) that are composed of many processing elements can successfully perform nonlinear mapping. Thus, to evaluate the robustness of the feature selection, all of the prediction algorithms mentioned above were tested in the present study. The original data in Fig. 3 were converted into a 27dimensional matrix (n = 27) after preprocessing. On the basis of the auto-correlation and partial auto-correlation results, a time step t p = 9 h was selected for the forecast. The prediction horizon h spans from 1 to 24 h. Then, the matrix was converted into supervised learning based on t p and h. The model hyperparameters (Table 4) were designed for each predicting algorithm using 10-fold cross-validation and then fit to each predicting algorithm. Note that "none" for the max depth of RF means "nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples" in Scikit learn (Pedregosa et al., 2011).

SAGE-based regional feature selection
Many methods have been utilized to investigate the significance of features for machine learning models. The gametheoretic method based on the Shapley value is the most   (Molnar, 2020), which is useful for finding the typical features of each region. There are two outstanding added values of SAGE (Covert et al., 2020). The first is its ability to remove large subsets of features because only removing individual features gives too little significance to features with sufficient proxies, such as in permutation tests. The other advantage of SAGE is its ability to select notable features from their conditional distribution instead of their marginal distribution, reducing unlikely feature combinations. Given the function W F , which represents the predictive power of a machine learning model F with subsets of features x s ⊆ x n , the SAGE algorithm can be written as follows: where means the loss function that measures the root mean squared error (RMSE) or mean absolute error (MAE);ŷ is the prediction from F; y represents the target variable; sets S and N store {1, 2, 3, . . ., s} and {1, 2, 3, . . ., n}, respectively; i is each single variable where i ∈ N; and n is the length of N. W F increases with a decline in the loss function for any subset S ⊆ N (note the minus sign in front of the loss function in Eq. 3). Equation (4) represents the Shapley value that is the weighted average of the incremental changes from adding i to subsets S ⊆ N\{i} (Covert et al., 2020). The more a feature contributes to the prediction from F, the larger the positive values φ i (W F ) would become. The computational costs of the SAGE analysis over machine learning models including RF, GB, and MLP are presented in Sect. 3.1. They are much more expensive than the model training and therefore cannot be repeated over all sites. Meanwhile, air pollution in nearby monitoring stations has inherent similarities because its forcing factors, i.e., meteorological and emission variables, are closely related in a given region. As in Zhai et al. (2019), all the available sites were partitioned into six categories in the present study:  Fig. 1. Therefore, we propose the regional future selection in which SAGE is only implemented in limited ensemble sites that are randomly selected in a given region, and the selected features would be used for model training and prediction at each regional site.
The framework of regional feature selection, as illustrated by Algorithm 1, is as follows. A total of 15 ensemble sta- for e = 1 to ensemble size do 6: for f = 1 to len(F) do 7: for g = 1 to len(h) do 8: Employ SAGE algorithm 9: Rank importance of input (A) for each h, F and z 10: end for 11: Re-rank A's importance (B) for each F and z 12: end for 13: end for 14: Re-rank B's importance (C) for each z 15: Take the three most important variables as features for each z 16: end for tions are randomly selected in each of the six regions. Taking NCP as an example, the significance of the features of the ensemble monitoring stations with four prediction horizons (6, 12, 18, and 24 h) and three prediction algorithms is analyzed using SAGE algorithms. Then, the outcomes of the ensemble-SAGE model are ranked, as displayed in the heatmap in Fig. 4. The heatmap highlights the significant features. PM 2.5 , CO, and v10 typically exhibit higher ranks in the 15 random monitoring stations and four prediction horizons. The heatmaps of the ensemble-SAGE analyses of the other five regions can be found in Figs. S2-S18 in the Supplement. The feature significance in different regions is ranked by the sum of the SAGE values in the ensemble monitoring stations and four prediction horizons, as displayed in Fig. 5. The feature selection based on the ensemble-SAGE analysis concerning Fig. 5 is explained in Sect. 3.2. Note that we also tried random ensemble numbers 10 and 20 in NCP key feature extraction using the MLP model at several prediction horizons. The choice of 15 is shown to give the most robust result with a minimum computation cost, and it is therefore used for all regional feature selections in this study.

Computational complexity analysis
Instead of performing feature selection for every forecast model independently, our proposed ensemble-SAGE analysis successfully interprets the important regional features for PM 2.5 prediction with substantially less computation complexity. In addition, the regional feature selection improves the forecast accuracy and saves significant computing power for the machine learning model training by excluding redundant inputs and speeding up the model convergence. In this study, all computations concerning the SAGE-based feature selection and machine learning model training were conducted on several nodes configured with 4×16-core 2.1 GHz Intel Xeon E5-2620 v4 CPUs and with 64 GB of memory.
The computational cost of SAGE varies significantly, with average times of 7353, 3891, and 3325 s when using GB, RF, and MLP, respectively. The maximum time costs reach 61 209, 65 127, and 23 931 s with GB, RF, and MLP, respectively. Thus, using SAGE for each PM 2.5 forecasting model with different air quality monitoring stations, prediction horizons, and machine learning models is time-consuming. As illustrated in Table 5, the time cost for the three machine learning model training sessions is greatly reduced using the inputs from SAGE-based regional feature selection.

Regional feature selection analysis
The results of the SAGE-based regional feature selection concerning the three machine learning models, six partitioned regions, and four forecasting horizons are discussed in this section. Taking the NCP as an example, critical features that govern the performance of PM 2.5 forecasting vary across stations and prediction horizons, as illustrated in Fig. 4. However, PM 2.5 , CO, and v10 (features) play an overwhelmingly positive role in MLP-based PM 2.5 forecasting for most selected stations and predicting horizons. This result indicates that these three features suit all of the stations in the NCP. SAGE analysis heatmaps for other regions or using different prediction algorithms can be found in Figs. S2-S18. Consistently critical features can be easily extracted from the five megacity cluster regions using our selection method. However, they are difficult to extract from the remaining area of China. This area does not have universal key features because its stations are spread widely across China and therefore exhibit substantially different air quality patterns. An improved station clustering method can help solve this issue and will be explored in our future research.
To extract the important robust features that fit all stations in a given region, we summed and ranked the SAGE analysis values in the ensemble monitoring stations and prediction horizons. The ensemble-SAGE ranking is displayed in Fig. 5. There are consistent, crucial features in the six cluster regions, regardless of the prediction algorithm or horizon. PM 2.5 is the most critical feature for predicting its trend at a particular region and with a particular prediction algorithm. In addition to PM 2.5 , two variables from CAMS reanalysis, co and pm2p5, are critical across all regions. This result suggests that the forecast of these variables from CAMS reanalysis can help capture the varying trend in the machine learning models, even though the predictions are different from the actual values. By contrast, time factors (week and hour information) are the least important features for short-term prediction. This result is consistent with that of Hu et al. (2014), where no distinct weekday/weekend difference was observed for PM 2.5 in the NCP and PRD.  Considering their generality and robustness, we selected the top three critical features for each region, as illustrated in Table 6. Note that the ensemble SAGE analysis selected different key features in different regions. In the NCP, the simulation of CO from CAMS reanalysis, which is a representative air pollutant, includes valuable information other than CO observations. This result implies that local precursor emissions are a major contributor to PM pollution (Guo et al., 2016), and non-point source pollution may be more favorable for PM 2.5 forecasting. Additionally, v10, which represents regional transmission, is a critical feature for PM 2.5 forecasting in the NCP, PRD, and YRD. This result indicates that regional transmission plays a vital role in these three regions (Chen et al., 2017;Liu et al., 2017). This finding is consistent with findings reported in recent studies.  found that the anomalously high, normalized, and near-surface meridional wind is typically the primary cause of the severe haze in the NCP using a chemical transport model. Huang et al. (2018) illustrated that regional transport accounts for over half of PM 2.5 under the polluted northerly airflow in winter. T.  discovered that the regional PM 2.5 pollution in winter is primarily from northern and eastern China using a trajectory model. However, v10 is less significant in the SCB. This result is because of the blocking effect of the plateau terrain on the northeasterly winds (Shu et al., 2021); hence, winds are frequently static, particularly in winter and autumn (Liao et al., 2017). By contrast, d2m and tp are crucial features for hourly PM 2.5 forecasting in the SCB. This finding may be because polluted weather patterns are typically associated with higher relative humidity in that area, and tp, representing rainfall, is vital to eliminate air pollution in a basin (Zhan et al., 2019).

Performance of RFSML
This section presents the forecasting skill of the proposed RFSML system driven by regional features selected by the ensemble-SAGE-based model. The results are also compared with those of a standard machine learning forecasting model and fourth-generation ECMWF global reanalysis data. The latter is referred to as the benchmark of chemical transport models. To highlight the improvements by using the selected key features, the regional performance which represents the average of the forecasting performance in all sites of the given region is introduced. Figure 6 displays the times series of the simulated PM 2.5 for the three forecasting systems (MLP model and at a predicting horizon of 12 h) versus observational data. Each subplot represents a random monitoring station in the corresponding five megacity cluster regions. The subplots illus-trate the typical behaviors observed for the other monitoring stations, machine learning models, and prediction horizons. Both the standard machine learning and RFSML models outperform the simple CTM. This result indicates that the machine learning algorithms are superior in air pollution prediction (Pérez et al., 2000). Additionally, PM 2.5 predictions with selected key features perform better than the standard machine learning forecast that uses all related features. The GB and RF machine learning models used in this study also show steady improvements.
Both the RFSML and standard machine learning predictions typically underestimate high PM 2.5 concentrations as the prediction horizon increases. This underestimation can be ascribed to three primary possible reasons. First, the correct features are difficult to obtain, and unsuitable features can bring significant bias and noise into the prediction algorithms. Second, the construction of our prediction algorithm network may be insufficiently complex or deep to determine the actual relationship between features and the PM 2.5 . However, considering the purpose of a real-time forecast, the time to forecast, which is closely related to the complexity of the prediction algorithm network, cannot be too long. Third, considering our test period only included late autumn and early winter of 2019, the training and validation periods only included autumn and winter of 2018, which is too short for Figure 6. Time series of test time in five megacity cluster regions. The black dots and red pentacles represent original and interpolated PM 2.5 respectively. The solid lines in light gray, light blue, and dark violet represent prediction of CAMS reanalysis, the standard machine learning system and RFSML respectively. Panels (a)-(e) represent a random site in NCP, YRD, PRD, SCB, and FWP respectively. Note that those are predictions 12 h in advance, which are in parallel with CAMS reanalysis's predicting horizon, and the machine learning model used here is MLP. a prediction algorithm to learn the complex relationship for hourly PM 2.5 forecasting. Seasonal training and validation may obtain satisfactory outcomes for a particular seasonal forecast (Bai et al., 2019). Figure 7 displays the spatial distribution of the RMSEs (columns a and b) and MAEs (columns c and d) of the PM 2.5 forecast for all stations either using the standard machine learning or RFSML system at a forecasting horizon of 12 h. The RMSEs and MAEs significantly decreased when using the selected key features for all three machine learning models, particularly in regions with severe PM 2.5 pollution, e.g., the NCP and FWP. This consistent improvement also occurs when the forecast horizon changes to 6, 18, and 24 h, and the results are illustrated in Figs. S19-S21 in the Supplement.
A modified Taylor diagram (Taylor, 2005) is plotted in Fig. 8 to show the overall outcome. RFSML forecasts with selected features typically exhibit a lower RMSE and higher R than the standard forecasts. The best improvement is ob-tained when the deep learning (MLP) model is used, while forecasts with the selected new features in the RF model are not significantly improved and even not as good as with forecasts that use all features.
This result can be explained by the characteristics of the two types of prediction algorithms. RF increases the diversity of the trees through the bootstrapped aggregation of several regression trees (bagging) (Brokamp et al., 2017). It has the advantage of maintaining low bias because tree-based methods with bagging can reduce the variance of an estimated prediction function. Some uninformative features can be ignored through bagging; i.e., RF reduces the high variance by growing the individual trees to a deep level and then making their predictions, typically through averaging (Liaw and Wiener, 2002). By contrast, MLP, which implements the global approximation strategy (Osowski et al., 2004), may face problems of multicollinearity and noise caused by uninformative features. The RMSE increases and R declines with an increase in the prediction horizons across all regions and machine learning models in general. The average coefficient of determination (R 2 ) of the 24 h forecast (the maximum horizon set in this study) based on the three machine learning models increases from 0.47 to 0.65 in the NCP, from 0.41 to 0.52 in the PRD, from 0.62 to 0.67 in the SCB, from 0.44 to 0.57 in the YRD, and from 0.62 to 0.65 in the FWP when using the ensemble-SAGE analysis-based feature selection. This results indicate that the RFSML system can provide the operational PM 2.5 forecast with a maximum horizon of 24 h.
To further confirm the predictive capability in a rolling way, we make forecasts over a less polluted month, April 2020. Specific results can be found in Table S1 in the Supplement. Steady improvement of predicting performance is still achieved by RFSML. Time series, as given in Fig. S22 in the Supplement, show similar results as the main text that RFSML has better predictive ability than standard machine learning. As is illustrated in Figs. S23-S24 in the Supplement, RFSML has both lower RMSE and MAE values than standard machine learning, which implies the advantage of RFSML.

Conclusions and future work
Machine learning models have been successfully utilized in air quality forecasts worldwide because of their high computational efficiency and accuracy. However, substantial room for improvement remains. In this study, we developed the RFSML v1.0 system, which can predict national air quality with high accuracy in real time in China.
In a standard machine learning system, all related features are typically utilized in model training and prediction. However, the high dimensionality and redundant input data may lead to increased complexity and machine learning model over-fitting. To overcome this obstacle, we combined an ensemble-SAGE analysis with our RFSML system. This method extracts the key features in a given region at an affordable extra cost, and the significance of these regional selected features are explained physically. Compared with the standard machine learning system that was fed with all relative features, the RFSML system driven by the selected key features resulted in superior interpretability, less training time, and more accurate predictions. Statistically, the average RMSE and MAE of predictions were reduced from 24.74 and 16.54 µg m −3 to 21.54 and 13.7 µg m −3 , respectively, with RFSML. Additionally, R 2 increased from 0.6 to 0.7, and the average forecasting model training cost was reduced from 129.36 to 17.66 s. Among the three machine learning models studied, the prediction performance of RFSML with MLP exhibited the greatest increase, with R 2 increasing from 0.55 to 0.72. By contrast, RF exhibited the least improvement, with R 2 increasing from 0.61 to 0.66. In addition, RF and GB were more robust than MLP for certain underlying uninformative features, while MLP was more susceptible to over-fitting. Meanwhile, RFSML provides only predictions over the air quality monitoring sites where historical data are available for machine learning model training, instead of a gridded forecast. A Bayesian-theory-based prediction fusion is being explored now to extend the RFSML forecast available at single stations to a gridded one.
The six-region partition used here was empirical and not based on science. Additionally, stations in a given region may exhibit different air quality patterns, particularly in the "remainder" region. Therefore, our ensemble-SAGE analysis does not always select the representative feature, limiting the machine model interpretability and prediction ability. A more scientific station partition like spatial clustering would be determined for future studies.
Based on the results of this study, the RFSML system can accurately predict air quality in the short term at the national scale; this renders it valuable for health professionals and policy makers in terms of providing early warning to population categories more susceptible to air pollution (e.g., children, elderly, and people with respiratory or cardiovascular issues) and reducing and regulating air pollution.
Code and data availability. The ground-based air quality monitoring observations are from the network established by the China Ministry of Environmental Protection and accessible via https://quotsoft.net/air/ (last access: June 2022). The measurements used in this study also are archived on Zenodo (https://doi.org/10.5281/zenodo.6551820, Fang, 2022). The RF-SML algorithm is in the Python environment and is archived on Zenodo (https://doi.org/10.5281/zenodo.6551850, , Fang, 2022).
Author contributions. JJ conceived the study and designed the RF-SML system. LF wrote the code of RFSML and carried out the prediction and evaluation. HXL, AS, CX, TD, and HL provided useful comments on the paper. LF prepared the manuscript with contributions from JJ and all other co-authors.
Competing interests. The contact author has declared that none of the authors has any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Financial support. This work is supported by the National Natural Science Foundation of China (grant nos. 42105109 and 42021004) and the Natural Science Foundation of Jiangsu Province (grant nos. BK20210664 and BK20220031).
Review statement. This paper was edited by Augustin Colette and reviewed by two anonymous referees.