Deep-Learning Spatial Principles from Deterministic Chemical Transport Model for Chemical Reanalysis: An Application in China for PM 2.5

.


Introduction
Pollutant concentration fields with high accuracy are important for evaluating health effects, climate changes and agricultural studies (Bell et al., 2007;Donkelaar et al., 2015;Gao et al., 2017).Long-term and reliable air quality dataset could also be used to assess pollutant emission control measures (Wang et al., 2010).Data fusion method has been widely used to obtain 30 accurate and spatially complete datasets, such as fusing air quality model simulations and station air pollutant observations to estimate fine-scale air pollutant concentration fields (Berrocal et al., 2012;Rundel et al., 2015).
In previous studies, there exists a general paradigm to develop well-estimated air pollutant concentration fields.In this paradigm, complex statistical models were trained to depict non-linear relationships between observations and proxy data and other supporting variables at the locations of observation sites (Berrocal et al., 2012;Lyu et al., 2019;Chu et al., 2016).The 35 widely used proxy data are Aerosol Optical Depth (Lv et al., 2016), chemical transport model (CTM) simulations (Lyu et al., 2019) and other geophysical variables.Popular statistical models include machine learning models of linear mixed effect model https://doi.org/10.5194/gmd-2021-253Preprint.Discussion started: 10 August 2021 c Author(s) 2021.CC BY 4.0 License.(Hao et al., 2015), random forest (Brokamp et al., 2018;Huang et al., 2021), deep neural networks (Qi et al., 2018), and ensembled models (Xiao et al., 2018).The fitted model was then used to predict concentration field of target variables in the whole area directly or through other spatial spreading techniques such as Bayesian estimation (Xu et al., 2016), partial linear 40 regression (Wang et al., 2016) and distance-constrained interpolations (Chang et al., 2014;Friberg et al., 2016).
Even though many high-quality datasets have been developed through deliberately designed statistical models and abundant explanatory variables, there are scientific gaps following this paradigm to develop air pollutant fields.First, these models usually rely on long-term and large-scale station observations for training, especially those complex time and space resolved models (Feng et al., 2020;Huang et al., 2021).For newly setup or temporally mobile observation networks, there would be 45 limited datasets for training an effective model.Second, most of the previous methods cannot well fuse multi-variable observations from different monitoring networks.For example, stations in air quality and meteorology observation networks are usually not spatially aligned.The observations in two networks could not be well directly fused in current models.Instead, meteorology reanalysis data were often used as important explanatory variables in previous fusion model (Geng et al., 2015;Ma et al., 2015;Wei et al., 2021).However, in real-time operational data fusion applications, these reanalysis data would be 50 unavailable or requiring intensive computations.Last but not the least, for most of the previous methods that fusing CTM simulations, they rely on relatively highly accurate and stable simulations to achieve good fusion performance (Tong and Mauzerall, 2006).Consistency in CTM parameters, configurations and inputs are also strictly required to achieve good data fusing performance.Especially in near-real-time operational data fusion applications, adjoint models are often required to be running simultaneously (Friberg et al., 2016).

55
To address these scientific gaps, this study developed a new deep-learning-based model framework to estimate reanalysis from station observations by learning spatio-temporal correlations from deterministic models.Distinct from the existing data fusion models, we do not use CTM simulations in regression directly as proxy of the real pollutant concentrations.Instead, the deep learning network was trained with only CTM model simulations to learn the dynamic correlations, which is backed by the CTM's first principals, between the simulations at the randomly selected grid points that mimic monitoring locations and at 60 the whole grid cells.The data fusion/reanalysis is then achieved by applying the learned dynamic correlations with real observational data in the prediction procedure.The model framework is fundamentally an alternative of generating chemical/meteorological reanalysis fields but without rerunning CTMs with data assimilation.

65
In this study, our data fusion model was trained to learn spatial correlations of multiple variables from CTM simulations.The simulated PM2.5 and other meteorological variables in 2016~2020 were produced using a modeling system that consists of three major components: The meteorology component (WRFv3.4.1) provides meteorological fields, the emission component provides gridded estimates of hourly emissions rates of primary pollutants that matched to model species, and the CTM component (CMAQ v5.0.2(Byun and Schere, 2006)) solves the governing physical and chemical equations to obtain 3-D 70 pollutant concentrations fields at a horizontal resolution of 12 km.We used the simulated daily mean surface layer predictions of PM2.5 concentrations, RH, and WS.The data covered the whole China with a size of 372×426 grid cells.Simulation data covering the 2016~2019 period was used as the training dataset, while the 2020 simulation data was used for evaluation.
https://doi.org/10.5194/gmd-2021-253Preprint.Discussion started: 10 August 2021 c Author(s) 2021.CC BY 4.0 License.(WS) for the same period at national meteorological observing stations were obtained from the China Meteorology Agency (CMA) network (Figure 1).The raw data of both PM2.5 and meteorology data were hourly, which were averaged to daily mean if there are more than 18 valid hourly observations in a day at the local time at each monitor.Each of these data items at each were assigned to a grid that was defined same as used in the aforementioned CTM simulations.For the sites that co-located in a same grid cell, their averages were also used.It should be noted that those grid cells, which do not have valid observations 85 within them, were filled with zero.

Ground Observations
Geographical variables such as the surface height of Digital Elevation Model (DEM), land use and land cover (LULC) (Zhang et al., 2020) were also used in this study for fusion.These data variables were also resampled to the afore-mentioned grid.

Deep Learning Data Fusion Framework
The objective of obtaining spatially complete air pollutant field from point observations can be regarded as a downscaling 90 problem, which indicates that values in gap areas among stations need to be optimally estimated from known sparse measurements based on physical or statistical constrains.Most previous studies use statistical methods to relate observations with other supporting variables at stations (Di et al., 2016;Beloconi et al., 2016).In this study, we built a point-to-grid model by learning from CTM simulations to generate gridded data fusion fields from station observations.A new deep learning model framework (Figure 2) was designed to fulfill the task of point-to-grid data fusion and downscaling.
This model includes two successive point convolutional (PointConv) operations and a deep learning backbone fusion module.
The PointConv is designed for handling spatially isolated and irregular station observations.In traditional convolutional operations, the 3×3 moving sum kernels were often used, which would lose effectiveness when it handles station observations.100 For example, when convolutional kernels coincide with grid cells without observations, the result will be zero.However, if the kernels coincide with grid cells with dense observations, the results will become significantly larger (Qi et al., 2018).To solve the problem, we proposed a novel and interpretable operation PointConv to handle isolated station observations of multiple variables.The successive PointConv operation is defined as follows, Where   refers to a convolutional kernel with a size n.The Conv(  , ) in Eq. (1) refers to the traditional convolution on x, which is station observations assigned to pre-defined grid cells.The x_one was binarized from x by replacing grid cells with valid observation data in x as 1.The PointConv was conducted for the second time by mimicking successive analysis 110 procedures as in Eq. ( 2).The PointConv kernel size in the two steps was determined to be 21 and 11 respectively for  1 and  2 .This model framework has the following features and advantages compared to conventional convolutions.
1) The weighted average of isolated data is implemented rather than weighted sum, 2) Large-size kernels are used to well reflect spatial correlations in a large area, (5) The operation  refers to appending different data variables as one multiple-layer data item.The  ̂ 2.5 refers to the estimated PM2.5 concentrations,  2.5 , refers to the original CTM simulations of PM2.5 with N equals to the number of total grid cells.

125
The fusion module can be any grid-to-grid deep learning model to estimate fused PM2.5 concentrations  ̂ 2.5 .Here we used https://doi.org/10.5194/gmd-2021-253Preprint.Discussion started: 10 August 2021 c Author(s) 2021.CC BY 4.0 License.a regression Unet++ () model (Eq.4) which was revised from the original Unet++ model (Zhou et al., 2018).The Unet++ model was designed as an Encoding-Decoding type network developed from Unet (Ronneberger et al., 2015).Many skip-connection modules (Yamanaka et al., 2017) were added in the Unet++ to fully explore spatial correlations in different scales while keeping abundant details in output results.RegrUnetPP was constructed by replacing the SoftMax activation 130 layers with the ReLU layers and adopting a mean absolute error (MAE) loss function (Eq.5) instead of the original MaxEntropy function.

Model Training
The model was trained with the WRF-CMAQ simulations of PM2.5, RH, and WS, together with geophysical covariates of DEM and LULC.In the training data, nominal point-wise 'station' data were constructed by randomly sampling 1500~2500 135 data points from gridded simulation data separately for each variable at each time, while raw spatially complete PM2.5 simulation data were used as the target gridded "truth" data.The spatial correlations of CTM simulations are backed by physical and chemical principles comprehensively represented in the WRF-CMAQ model.The fusion model was trained with the WRF-CMAQ CTM simulations within China from 2016 to 2019 for 20000 iterations with a batch size of 10 when the loss function became stable running on a NVIDIA RTX GeForce 2080Ti GPU card.It should be highly noted that the observation data were 140 not involved in the model training procedure at all.In the model prediction procedure, actual station observations will be used as input to generate fused PM2.5 concentration fields.

Model Evaluation
In general, the evaluation was conducted for 2020 as it is independent from the training data period of 2016~2019.Specifically, the fitted fusion model was evaluated in two aspects.Firstly, its capabilities to predict the fully gridded model simulations 145 from isolated sampled grid cells were assessed with the CTM simulation data in 2020.In this aspect, the station-wise CTM PM2.5 simulations were constructed by sampling those grid cells with observations stations from raw gridded simulations.By feeding these point-wise simulations and supporting static variables into the fusion model, spatially completed grided data are obtained.The fused simulation data are then compared against the corresponding raw CTM PM2.5 simulations.The comparison was performed in each day, since there are sufficient data items in daily simulations.It should be noted that only those grid 150 cells located in mainland China area were compared.Statistical metrics of coefficient of determinant (R 2 ), root mean square error (RMSE) and normalized mean absolute error (NME) were calculated for performance evaluation.
For the second aspects, data fusion model performance was evaluated with station observations using two cross validation methods.Specifically, Leave-Stations-Out cross-validation methods (LSCV) (Lv et al., 2016) and stringent ten-fold Leave-Cities-Out cross-validation (LCCV) were used.In the LCCV method, all cities with PM2.5 stations were randomly split into 155 ten groups, while in the LSCV method all stations were randomly split into ten groups.PM2.5 observations in one group of stations were used as independent evaluation data, while the data in remaining nine groups were used for data fusion.This process was iteratively performed ten times.Considering that the air quality stations are mostly clustered in cities' urban area, the LCCV method will better reflect the model's performance in predicting PM2.5 concentrations in the remote rural areas than the station-based LSCV method.Statistical metrics of R 2 , RMSE, and NME are also used for statistical measures.(Shepard, 1968).The PointConv kernels values in the central area are respectively around 1.5, 1.1, and 1.4 for PM2.5, RH, and WS in both steps.The kernels' distribution also revealed that the influencing distance for PM2.5 is around 6 grid cells, which was equivalent to 72 kilometers in terms of the 12 km resolution.

170
For RH, the spatial correlations are weak considering that kernels were more spatially uniform as exhibited in Figure 3.For wind speed, it exhibited a stronger locality indicated by the smaller hot spot with a radius around 4 grid cells (~48 kilometers).
The kernels were generally isotropic with slightly larger values in the northeast-southwest direction than in other directions, which could be caused by topographic and climatic patterns in China.We implemented the data fusion model with the observations in 2020 to generate the fused fields for evaluation.The evaluation results exhibited good performance with R 2 =0.77 for the LCCV method and R 2 =0.83 for the LSCV method (Figure 6), with the RMSE values respectively 16.32 and 14.25 μg/m 3 .Considering that most grid cells were located within city urban areas, 200 actual model performance should be in between the metrics evaluated by LCCV and LSCV, which is 0.77~0.83for R 2 , 14.25~16.32μg/m 3 for RMSE and 0.27~0.32 for NME.Previous studies tend to underestimate PM2.5 concentrations in high pollution scenarios (Di et al., 2016;Senthilkumar et al., 2019).Our data fusion method predicted high level PM2.5 concentrations very well, with NME for PM2.5 concentration higher than 150 μg/m 3 being small of 0.19 and 0.14 respectively

Discussions
In this study, PM2.5 fields are fused from multiple observational variables using a novel deep learning data fusion model 235 framework.As we have demonstrated, the method can accurately reproduce the whole domain CTM simulations from a small portion of simulations selected at sparse locations.By fully utilizing such learned spatial correlations that simulated by the CTM models, it can accurately generate spatio-temporally complete fused fields from using observation at sparse locations as well.
In previous studies, all variables need to be spatially paired at stations first to train regression models (Lyu et al., 2019;Ma et 240 al., 2016;Xue et al., 2019).To use data from different networks, interpolation and analysis/reanalysis need to be carried out (and any other model species/variables as well) and its supporting variables, by retaining physical and chemical principles in the WRF-CMAQ model.Hence, the method can be readily applied for other CTM simulated species that with measurements 250 available.Second, it doesn't need any observation data sets to train the model.This is quite beneficial for data fusion applications, especially when station networks are newly set up or observations are from mobile or portable sensors.The data fusion models used in previous studies are very complex requiring long-term observations (Wei et al., 2021;Huang et al., 2021;Xue et al., 2019), which make it difficult to be reproducibly used in new applications.Conversely, our method is straightforward to use and can be easily examined by inter-comparison with other methods.It provides a pre-trained deep learning 255 model for its application in other studies.To run this model, users only need air quality observations, meteorological observations, and static variables.
Considering that the model training process is to fully learn the general spatial correlations among different variables, CTM simulations theoretically do not need to be very accurate in model inputs but need to be consistent in model configurations of physical and chemical processes.For example, input changes of pollutant emissions and meteorological fields within the CTM 260 simulations are allowed and should be encouraged to cover a wider range of emissions and meteorological scenarios for better training.Comparatively in other observation-simulation regression methods, both model configurations and inputs are usually required to be unchanged in training data sets (Xue et al., 2017;Hao et al., 2015).In fact, in our method, larger variations of input emissions and meteorological conditions can be even more beneficial to help improve the robustness of the trained model in applications with the dramatical change of emissions or extreme climate conditions.However, substantial changes of 265 atmospheric physical and chemical principles in models can deteriorate the model performance because it modifies the simulated correlations between different variables.The fused output data sets have high accuracy while following the known spatio-temporal principles represented in the state-of-the-art air quality models, which will be beneficial for further studies of downscaling, nowcasting and model forecast post-processing based on fused data set.
The model framework in this study has very high computational efficiency, with computing time for one-time fusion far less 270 than 1s running on a consumer GPU card of NVIDIA 2080Ti.The calculation time will not increase much by enabling processing near-real-time data in a large area.

Figure 1 :
Figure 1: The map of the study area with elevation in color.Dark red dots represent the national PM2.5 monitors and orange dots refer to national meteorological stations.The air quality and meteorology observations were only used in predicting fused data fields.PM2.5 observations in 2020 from the China National Environmental Monitoring Center (CNEMC) (http://106.37.208.233:20035/) were used, with the monitoring network as exhibited in Figure 1.Meteorological variables of daily mean relative humidity (RH) and wind speed 80

Figure 2 :
Figure 2: Data fusion framework using station observations of multiple variables to obtain gridded fields of PM2.5.
3) Successive PointConv operations are implemented to reflect local variations, 115 4) Multi-variable observations from different networks are handled simultaneously.The PointConv kernels in well-trained models are expected to have larger values in the center area and lower values in the outer area.With the PointConv module, the spatially complete gridded data set are constructed, denoted as PC in Eq. (3).By binding results of PointConv with other static supplementary data such as DEM and LULC, input data to data fusion module RegrUnetPP is built as exhibited in Eq. (4).120  ̂ 2.5 =  ((  2.5 ,   ,   , , )) (4) 2.5 , −  ̂ 2.5 , |  =1

Figure 3 :
Figure 3: PointConv kernels for PM2.5, RH and WS.The PointConv was interpretable due to its dedicated design to implement a process like an interpolation from station 165

Figure 4 :
Figure 4: Daily a) R 2 , b) RMSE and c) NME values evaluated between prediction CTM PM2.5 simulations and raw gridded simulations.The data fusion model has very high accuracies in predicting/reproducing fully gridded CTM PM2.5 simulations as exhibited in Figure4, even though data items in only ~800 grid cells in which with observation stations located were used to estimate 180

Figure 5 :
Figure 5: The seasonal average PM2.5 concentrations of the raw CTM simulations (a to d, first row), and the reproduced simulations using data fusion models (e to h, second row).By comparing the raw CTM simulations of daily average PM2.5 and the reproducedPM2.5 fields from station-wise simulations, 190 195

Figure 6 :
Figure 6: Scatter plots of predictions versus observations evaluated respectively by the method of a) LCCV and b) LSCV.
https://doi.org/10.5194/gmd-2021-253Preprint.Discussion started: 10 August 2021 c Author(s) 2021.CC BY 4.0 License.for LCCV and LSCV.It worth noting that there exist increased errors from reproducing CTM simulations (R 2 =0.93) to 205 generating reanalysis fields (R 2 =0.77~0.83).The difference of 0.1~0.16should be mainly attributed to CTM simulation uncertainties of PM2.5 spatial correlations compared to actual observed correlations.Our model has good performance comparing to the previous studies that used the spatial cross-validation method.For example, Lyu et al. (2019) used an ensemble deep learning model to build relations between CTM simulations and observations of PM2.5 in China with a performance of R 2 =0.64 and RMSE=24.8 μg/m 3 using a station-level evaluation method (2019).Xue et al. 210 (2020) fused AOD, CTM simulations and ground observations with a complex multi-stage model and achieved a good performance of R 2 =0.81 with LSCV method (2017).Xiao et al. (2018) built up an ensemble machine learning model to predict PM2.5 at 0.1° resolution with an accuracy of R 2 =0.76 (2018).Huang et al. (2021) used a multi-stage random-forecast-based model to predict very high-resolution data set and achieved R 2 = 0.92 with the LSCV method (2021). 215

Figure 7 :
Figure 7: Reanalysis PM2.5 concentration fields in four seasons in 2020.Circles with filled colors represent monitoring sites and corresponding observations.Daily fused PM2.5 fields for 2020 were obtained with the model framework.Considering the fused/reanalysis fields are complete in space, the high pollution levels in winter are well revealed in details in the North China Plain (NCP), and in the long-narrow basin areas of Shanxi and Shaanxi provinces (Figure 7).Besides, unlike most previous studies (Huang et al., 220

Figure 8 :
Figure 8: Comparison between MODIS AOD and PM2.5 fusion data on October 26 and November 13 in 2020.To further evaluate the spatial distributions of the fused PM2.5 fields, we compare them with the MODIS AOD distributions at 225 https://doi.org/10.5194/gmd-2021-253Preprint.Discussion started: 10 August 2021 c Author(s) 2021.CC BY 4.0 License.which is disconnected from the data fusion model.Here with the two PointConv modules, it can fuse station data variables from different observation networks, even when they are not spatially aligned at collocations.The successive PointConv modules were able to process station data for each variable independently before the fusion.The PointConv modules were trainable as part of the whole deep learning data fusion model.Without data pairing procedure, the model training and 245 prediction procedure became straightforward that only requires a same spatial grid setting for all input variables.This model was fitted with model simulation data by learning daily spatial patterns from long-term CTM simulations.It has two benefits.First, the trained deep learning data fusion model can represent and reflect spatial correlations between https://doi.org/10.5194/gmd-2021-253Preprint.Discussion started: 10 August 2021 c Author(s) 2021.CC BY 4.0 License.