Evaluation of the offline-coupled GFSv15–FV3–CMAQv5.0.2 in support of the next-generation National Air Quality Forecast Capability over the contiguous United States

As a candidate for the next-generation National Air Quality Forecast Capability (NAQFC), the meteorological forecast from the Global Forecast System with the new Finite Volume Cube-Sphere dynamical core (GFS–FV3) will be applied to drive the chemical evolution of gases and particles described by the Community Multiscale Air Quality modeling system. CMAQv5.0.2, a historical version of CMAQ, has been coupled with the North American Mesoscale Forecast System (NAM) model in the current operational NAQFC. An experimental version of the NAQFC based on the offline-coupled GFS–FV3 version 15 with CMAQv5.0.2 modeling system (GFSv15–CMAQv5.0.2) has been developed by the National Oceanic and Atmospheric Administration (NOAA) to provide real-time air quality forecasts over the contiguous United States (CONUS) since 2018. In this work, comprehensive region-specific, time-specific, and categorical evaluations are conducted for meteorological and chemical forecasts from the offline-coupled GFSv15–CMAQv5.0.2 for the year 2019. The forecast system shows good overall performance in forecasting meteorological variables with the annual mean biases of −0.2 °C for temperature at 2 m, 0.4% for relative humidity at 2 m, and 0.4 m s−1 for wind speed at 10 m compared to the METeorological Aerodrome Reports (METAR) dataset. Larger biases occur in seasonal and monthly mean forecasts, particularly in spring. Although the monthly accumulated precipitation forecasts show generally consistent spatial distributions with those from the remote-sensing and ensemble datasets, moderate-to-large biases exist in hourly precipitation forecasts compared to the Clean Air Status and Trends Network (CASTNET) and METAR. While the forecast system performs well in forecasting ozone (O3) throughout the year and fine particles with a diameter of 2.5 μm or less (PM2.5) for warm months (May–September), it significantly overpredicts annual mean concentrations of PM2.5. This is due mainly to the high predicted concentrations of fine fugitive and coarse-mode particle components. Underpredictions in the southeastern US and California during summer are attributed to missing sources and mechanisms of secondary organic aerosol formation from biogenic volatile organic compounds (VOCs) and semivolatile or intermediate-volatility organic compounds. This work demonstrates the ability of FV3-based GFS in driving the air quality forecasting. It identifies possible underlying causes for systematic region- and time-specific model biases, which will provide a scientific basis for further development of the next-generation NAQFC.


Introduction
Three-dimensional air quality models (3-D AQMs) have been widely applied in real-time air quality forecasting (RT-AQF) since the 1990s in the US (Stein et al., 2000;McHenry et al., 2004;Zhang et al., 2012a). The developments and applications of the national air quality forecasting systems based on 3-D AQMs were conducted in the 2000s (Kang et al., 2005;Otte et al., 2005;McKeen et al., 2005McKeen et al., , 2007McKeen et al., , 2009). Since then, improvements and significant progress have been achieved in RT-AQF through the further development of AQMs and the use of advanced techniques. For example, more air pollutants in the products, more detailed gas-phase chemical mechanisms and aerosol chemistry, and the implementation of chemical data assimilation were available (Zhang et al., 2012b;Lee et al., 2017). Various AQMs, coupled with meteorological models in either an online or offline manner, were developed and applied in RT-AQF (e.g., Chuang et al., 2011;Lee et al., 2011;Žabkar et al., 2015;Ryan, 2016). The early version of the National Air Quality Forecast Capability (NAQFC) was jointly developed by the US National Oceanic and Atmospheric Administration (NOAA) and the U.S. Environmental Protection Agency (EPA) to provide forecasts of ozone (O 3 ) over the northeastern US (Eder et al., 2006). Since the first operational version over the contiguous United States (CONUS) (Eder et al., 2009), the NAQFC has been continuously updated and developed to provide more forecasting products (including O 3 , smoke, dust, and particulate matter with a diameter of 2.5 μm or less (PM 2.5 )) with increasing accuracy Stajner et al., 2011;Lee et al., 2017).
The forecast skill of a historical NAQFC, which was based on the North American Mesoscale Forecast System (NAM) model (Black, 1994) and the Community Multiscale Air Quality Modeling System version 4.6 (CMAQv4.6), over CONUS during the year 2008 was evaluated by Kang et al. (2010a) for operational O 3 and experimental PM 2.5 products. Overall, maximum 8 h O 3 was slightly overpredicted over the CONUS during the summer, with a mean bias (MB), normalized mean bias (NMB), and correlation coefficient (Corr) of 3.2 ppb, 6.8%, and 0.65, respectively. The performance of predicted daily mean PM 2.5 varied: there was an underprediction during the warm season and an overprediction in the cool season. The MBs and NMBs during warm/cool seasons were −2.3/4.5 μg m −3 and −19.6%/45.1%, respectively. The current version of the US NOAA's operational NAQFC has provided the air quality forecast to the public for O 3 and PM 2.5 at a horizontal grid resolution of 12 km over CONUS since 2015. It is currently based on the CMAQv5.0.2 (released May 2014) (U.S. EPA, 2014) coupled offline with the NAM model. Daily mean PM 2.5 was underpredicted during warm months (May and July 2014) and overpredicted during a cool month (January 2015) over CONUS .
Efforts have been made to reduce the seasonal and region-specific biases in the historical and current NAQFC. Development and implementation of an analog ensemble bias correction approach was applied to the operational NAQFC to improve forecast performance in PM 2.5 predictions . Kang et al. (2008Kang et al. ( , 2010b investigated the Kalman filter (KF) bias-adjustment technique for operational use in the NAQFC system. The KF bias-adjusted forecasts showed significant improvement in both O 3 and PM 2.5 for discrete and categorical evaluations. However, limitations in the underlying models and the bias correction or adjustment approaches need further improvement. Characterizing the current NAQFC forecasting skill and identifying the underlying causes for region-and time-specific biases can result in further development of the NAQFC system and improved pollutant predictions.
As the NOAA Environmental Modeling Center (EMC) has transitioned to devote its full resources to the development of an ensemble model based on the Finite Volume Cube-Sphere Dynamical Core (FV3), NAM has been no longer updated since March 2017. The FV3 dynamic core will eventually replace all current NOAA National Centers for Environmental Prediction (NCEP) mesoscale models used for forecasting. The FV3 dynamical core was implemented in the operational Global Forecast System as version 15 (GFSv15) in July 2019.
The NOAA National Weather Service (NWS) is currently coordinating an effort to inline a regional-scale meteorological model based on the same FV3 dynamic core as that in GFSv15 to be coupled with an atmospheric chemistry model partially based on CMAQ. The inline system is expected to be the next generation of NAQFC and to be implemented a few years into the future. An interim system, offline coupling the recent CMAQ with FV3-based GFS is regarded as a candidate NAQFC to replace the current NAM-CMAQ system before the inline system is applied in operational air quality forecasting. To support this new development of the interim NAQFC, a prototype of the offline-coupled GFSv15 with CMAQv5.0.2 (GFSv15-CMAQv5.0.2) has been developed and applied by the NOAA for RT-AQF over CONUS since 2018 (Huang et al., 2018(Huang et al., , 2019. In this work, the meteorological and air quality forecasts from the offline-coupled GFSv15-CMAQv5.0.2 system are comprehensively evaluated for the year of 2019. The main objectives of this work are to (1) evaluate the forecast skills of the experimental prototype of the GFSv15-CMAQv5.0.2 system, (2) identify the major model biases, in particular, systematic biases and persistent region-and time-specific biases in major species, and (3) investigate underlying causes for the biases to provide a scientific basis for improving the model representations of chemical processes and developing science-based bias correction methods for O 3 and PM 2.5 forecasts. This work will support NAQFC's further development and improvement through enhancing its forecasting abilities and generating a benchmark for the interim NAQFC that is being developed by NOAA based on the offline-coupled GFS-FV3 v16 with CMAQv5.3 (NACC-CMAQ) (Campbell et al., 2020). Eventually, the latest version of CMAQ (version 5.3), which has updates in gas-phase chemistry (Yarwood et al., 2010;Emery et al., 2015;Luecken et al., 2019), lightning nitric oxide (LNO) production schemes (Kang et al., 2019a, b), and secondary aerosol formation (in particular, secondary organic aerosol) (e.g., Pye et al., 2013Pye et al., , 2017Murphy et al., 2017) among other things, will be coupled with GFS-FV3 v16 and be implemented in the interim operational NAQFC.
2 Model system and evaluation protocols 2.1 Description and configuration of offline-coupled GFSv15-CMAQv5.0.2 FV3 is a dynamical core for atmospheric numerical models developed by the Geophysical Fluid Dynamics Laboratory (GFDL) (Putman and Lin, 2007). It is a modern and extended version of the original FV core with a cubed-sphere grid design and more computationally efficient solvers. It was selected for implementation into the GFS as the next generation dynamical core in 2016 (C. . The GFS-FV3 v15 (GFSv15) has been operational since June 2019. The GFSv15 uses the Rapid Radiative Transfer Method for General Circulation Models (RRTMG) scheme for shortwave or longwave radiation (Mlawer et al., 1997;Iacono et al., 2000;Clough et al., 2005), the Hybrid eddy-diffusivity mass-flux (EDMF) scheme for the planetary boundary layer (PBL) (National Centers for Environmental Prediction, 2019a), the Noah Land Surface Model (LSM) scheme for the land surface option (Chen et al., 1997), the simplified Arakawa-Schubert (SAS) deep convection for cumulus parameterization (Arakawa and Schubert, 1974;Grell, 1993), and a more advanced GFDL microphysics scheme for microphysics (National Centers for Environmental Prediction, 2019b). An interface preprocessor has been developed by NOAA to interpolate data, transfer coordinates, and convert the GFSv15 outputs into the data format required by CMAQv5.0.2 (Huang et al., 2018(Huang et al., , 2019. The original outputs from GFSv15, which have a horizontal grid with 13 km resolution and a Lagrangian vertical coordinate with 64 layers in I/O format for the NCEP models using the NOAA Environmental Modeling System (NEMSIO), are processed to Lambert conformal conic projection by PREMAQ, a preprocessor, to recast the meteorological fields for CMAQ into an Arakawa Cstaggering grid (Arakawa and Lamb, 1977) with a 12 km horizontal resolution and 35 vertical layers (Table 1). The first 72 h in 12:00 UTC forecast cycles from GFSv15 are used to drive the air quality forecast by the offline-coupled GFSv15-CMAQv5.0.2 system. CMAQ has been continuously developed by the U.S. EPA since the 1990s (Byun and Schere, 2006) and has been significantly updated in many atmospheric processes since then. Chemical boundary conditions for the GFSv15-CMAQv5.0.2 system are mainly from the global 3-D model of atmospheric chemistry driven by meteorological input from the Goddard Earth Observing System (GEOS-Chem). The lateral boundary condition for dust is from the outputs of the NOAA Environmental Modeling System GFS aerosol component (NGAC) (Lu et al., 2016). The anthropogenic emissions from area, mobile, and point sources in the National Emissions Inventory of the year 2014 version 2 (NEI 2014v2) are processed by the Sparse Matrix Operator Kernel Emissions (SMOKE) modeling system. The on-road mobile sources include all emissions from motor vehicles that operate on roadways, such as passenger cars, motorcycles, minivans, sport-utility vehicles, light-duty trucks, heavy-duty trucks, and buses. On-road mobile source emissions were processed using emission factors output from the Motor Vehicle Emissions Simulator (MOVES). SMOKE uses a combination of vehicle activity data, emission factors from MOVES, meteorology data, and temporal allocation information to estimate hourly, gridded on-road emissions. The non-road, agriculture, anthropogenic fugitive dust, non-elevated oil-gas, residential wood combustion, and other sectors are included in the area sources. The sectors of airports, commercial marine vessel (CMV), electric generating units (pt_egu), point sources related to oil and gas production (pt_oilgas), point sources that are not electric generating units (EGUs) nor related to oil and gas (pt_nonipm), and point sources outside the US (pt_other) are included in the point sources. The sulfur dioxide (SO 2 ) and nitrogen oxide (NO x ) from point sources in NEI 2005 are projected to the year 2019 following the methods used in Tang et al., (2015Tang et al., ( , 2017. The biomass burning emission inventory from the Blended Global Biomass Burning Emissions Product system (GBBEPx) (X.  is implemented for the forecast of forest fires. The GBBEPx fire emission is treated as one type of point source. Its heat flux is derived from satellite-retrieved fire radiative power (FRP) to drive fire plume rise. The GBBEPx is a near-real-time fire dataset. The fire emission implemented in the current forecast cycle comes from the historical fire observation, typically 1-2 d behind. In this system, we use land use information to classify fires as forest fire and other burning such as agriculture burning. We assume that only forest fire can last longer than 24 h. We assume that the forest fire emission will continue on day 2 and beyond. Other types of fires will be dropped. The plume rise of the point source will be driven by the meteorology and allocated to the 35 elevated layers in the GFSv15-CMAQv5.0.2 system by the PREMAQ preprocessing system. Biogenic emissions are calculated inline by the Biogenic Emission Inventory System (BEIS) version 3.14 (Schwede et al., 2005). Sea-salt emission is parameterized within CMAQv5.0.2. While the deposition velocities are calculated inline, the fertilizer ammonia bidirectional flux for inline emissions and deposition velocities is turned off. Detailed configurations of photolysis, gas-phase chemistry, aqueous chemistry, and aerosol chemistry for CMAQv5.0.2 are listed in Table 1.

Datasets and evaluation protocols
A comprehensive evaluation of the GFSv15-CMAQv5.0.2 forecasting system is conducted for both meteorological and chemical variables for the year 2019, including discrete, categorical, and region-specific evaluations. The products in the first 24 h of each 72 h forecast cycle are extracted and combined as a continuous, annual forecast. The evaluation of meteorological variables is carried out for those results from PREMAQ in the GFSv15-CMAQv5.0.2 system. Detailed information for datasets used in this study is listed in Table  S1 in the Supplement. Observed hourly temperature at 2 m (T2), relative humidity at 2 m (RH2), precipitation (Precip), wind direction at 10 m (WD10), and wind speed at 10 m (WS10) are obtained from the Clean Air Status and Trends Network (CASTNET) and the METeorological Aerodrome Reports (METAR) datasets. The majority of CASTNET sites are suburban and rural sites. Approximately 1900 METAR sites over CONUS are used in this study (Fig. S1 in the Supplement). For the evaluation of precipitation, a threshold of ≥ 0.1 mm h −1 is used for valid records because CASTNET and METAR have different definitions of 0.0 mm h −1 values. In CASTNET, the records without any precipitation are given as 0.0 mm h −1 , the same as those records with negligible precipitation. However, in METAR, the records without any precipitation are left blank, the same as an invalid record. The negligible precipitation is recorded as 0.0 mm h −1 .
The air quality forecasting products that are evaluated include hourly O 3 , hourly PM 2.5 , maximum daily 8 h average O 3 (MDA8 O 3 ), and daily average PM 2.5 (24 h average PM 2.5 ) for chemical forecast. The AIRNow dataset is used for observed hourly O 3 and PM 2.5 . We utilize the quality assurance/quality control (QA/QC) information from the AIRNow dataset to filter the invalid records. Remote-sensing data from the Global Precipitation Climatology Project (GPCP) and the Climatology-Calibrated Precipitation Analysis (CCPA) (Hou et al., 2014;Zhu and Luo, 2015) datasets are also used for the evaluation of precipitation. GPCP is a global precipitation dataset with a spatial resolution of 0.25° and a monthly temporal resolution. The CCPA uses linear regression and downscaling techniques to generate an analysis product of precipitation from two datasets: the NCEP Climate Prediction Center Unified Global Daily Gauge Analysis and the NCEP EMC Stage IV multi-sensor quantitative precipitation estimations (QPEs). The CCPA product with a spatial resolution in 0.125° and temporal resolution of an hour is used in this study. Satellite-based aerosol optical depth (AOD) at 550 nm from the Moderate Resolution Imaging Spectroradiometer (MODIS) Terra platform (Levy and Hsu, 2015) is used for the evaluation of monthly AOD. The statistical measures such as mean bias, the root mean square error (RMSE), the normalized mean bias, the normalized mean error (NME), and the correlation coefficient are used; more details about evaluation protocols are found in Zhang et al. (2009Zhang et al. ( , 2016. The Taylor diagram (Taylor, 2001), which includes the correlations, NMBs, and the normalized standard deviations (NSDs), is used to present the overall performance (Wang et al., 2015). The NMBs ≤ 15% and NMEs ≤ 30% by Zhang et al. (2006) and NMBs (≤ 15% and ≤ 30%), NMEs (≤ 25% and ≤ 50%), and Corr (> 0.5 and > 0.4) for MDA8 O 3 and 24 h PM 2.5 , respectively, by Emery et al. (2017) are regarded as performance criteria. Monthly, seasonal, and annual statistics and analysis are included. Seasonal analysis for O 3 is separated into an O 3 season (May-September) and a non-O 3 season (January-April and October-December).
Analysis for 10 CONUS regions, defined by the U.S. EPA (http://www.epa.gov/aboutepa, last access: 10 August 2020), is included and listed in Fig. S1c in the Supplement.
The metrics of false alarm ratio (FAR) and the hit rate (H) are used (Kang et al., 2005;Barnes et al., 2009) for categorical evaluation. Observed and forecasted MDA8 O 3 and 24 h average PM 2.5 values are divided into four classes based on whether the predicted and/or observed data fall above or below the air quality index (AQI) thresholds: (a) observed values ≤ thresholds and predicted values > thresholds, (b) observed and predicted values > thresholds, (c) observed and predicted values ≤ thresholds, and (d) observed values > thresholds and predicted values ≤ thresholds. The FAR and H are defined in Eqs. (1) and (2): 3 Evaluation of model forecast skills

Evaluation of meteorological forecasts
Discrete performance evaluation is conducted for postprocessed meteorological fields from the GFSv15-CMAQv5.0.2 system ( Table 2). The GFSv15 can predict the boundary layer meteorological variables well. It has overall cold biases and wet biases for annual T2 and RH2 in 2019, respectively. It also overpredicts WS10, and underpredicts hourly precipitation. Despite the CASTNET siting being slightly different from that of METAR, the annual and most of the seasonal performance for the model shows a similar pattern in terms of bias for both the CASTNET and METAR networks. The mean biases of T2 are mostly within ±0.5°C except those in February and March compared to CASTNET (Table S2 in the Supplement). Underprediction is generally larger compared to CASTNET than METAR. For a spatial distribution of MB for seasonal T2 compared to METAR ( Fig. S2 in the Supplement), cold biases are mainly found in the Midwest and western US where most of the CASTNET sites are located. GFSv15 usually underpredicts T2 on the west coast, the mountain states, and the Midwest. Overpredictions of T2 in the states of Kansas, Oklahoma, the areas near the east coast, and the Gulf Coast offset some underpredictions, resulting in smaller mean biases but a similar RMSE for the model compared to METAR as opposed to that compared to CASTNET. The difference between observed T2 from the two datasets is larger in cooler months than warmer months. The largest underpredictions occur in the spring (March, April, May -MAM) season. In general, GFSv15 underpredicts T2 for both CASTNET and METAR, consistent with cold biases found in other studies using GFSv15 (e.g., Yang, 2019). Such underpredictions will affect chemical forecasts, especially the forecast of O 3 . Consistent with the overall underpredictions of T2, GFSv15 overpredicts RH2 in general. The largest overprediction is found in spring (MBs of 3.4% and 2.7% with CASTNET and METAR, respectively), corresponding to the largest underprediction of T2 in spring (MBs of −0.5 and −0.4 °C with CASTNET and METAR, respectively). GFSv15 shows moderately good performance when predicting wind. The annual MB and NMB of WS10 compared to METAR are 0.4 m s −1 and 10.7%, respectively. A larger overprediction of WS10 is found with CASTNET than with other datasets (Zhang et al., 2016). GFSv15-CMAQv5.0.2 also gives higher overpredictions for CASTNET compared to METAR. The largest biases in wind speed are found in summer. GFSv15-CMAQv5.0.2 gives the largest cold biases and wet biases in spring, indicating the necessity of improving model performance in such seasons in future GFS-FV3 development.
By adopting the threshold of ≥ 0.1 mm h −1 , performance compared to the CASTNET and METAR shows similar results: a large underprediction in hourly precipitation. Predicted monthly accumulated precipitation shows consistency in spatial distribution with observations from CCPA and GPCP ( Fig. S3 in the Supplement). The high precipitation in the southeast is captured well in spring, while the high precipitation in the Midwest and south is captured well in other seasons. It indicates that GFSv15-CMAQv5.0.2 has good performance in capturing the spatial distributions of accumulated precipitation but has poor performance in predicting hourly precipitation. The precipitation from the original FV3 outputs is recorded as 6 h accumulated precipitation. Artificial errors were introduced to the forecast by an issue in precipitation preprocessing during the early stage of the development of the GFSv15-CMAQv5.0.2 system. The precipitation at the first hour of the 6 h cycle would be dropped occasionally. We corrected this issue and the hourly precipitation still shows a large underprediction compared to surface monitoring networks ( Fig. S4 in the Supplement). It indicates the difficulty for the forecast system in capturing the temporal precipitation, especially during summer. During the summer season, the discrepancy in capturing the short-term heavy rainfall worsens the model performance in predicting hourly precipitation. Besides, we use the threshold of 0.1 mm h −1 to filter the valid records. If the model predicts precipitation that did not occur, the record will be excluded from the statistics calculation. However, all the predicted precipitation is counted in the spatial evaluation against the ensemble datasets of GPCP and CCPA. Therefore, the spatial performance of monthly accumulated precipitation shows better agreement than its of hourly statistics.
An overall comparison of performance with the CASTNET and METAR datasets is performed using a Taylor diagram (Fig. 1). The NSDs, Corrs, and NMBs are considered. The NSDs are ratios of the variance of predicted values to the variance of observed values, following the equations by Wang et al. (2015). The NSDs represent the amplitude of variability. With the NSDs closer to 1, the predicted values have closer variance than the observed values. Consistent with other analysis in this section, larger biases and lower correlation in model wind speed and wind direction are found for CASTNET compared to METAR. The amplitude of variability of WS10 compared to CASTNET is overpredicted (with the NSD larger than 1), while it is underpredicted compared to METAR. Because of the postprocessing smearing of hourly precipitation, the variance of predicted precipitation is smaller than the observed one, leading to very small NSDs for precipitation.  (Table 3) As an important surrogate for the fugitive dust, the spatial distribution of large PMC emission is associated with the regions which have the significant overprediction in cooler months. In reality, the meteorological conditions could greatly impact the amount and characteristics of anthropogenic fugitive dust. For example, the snow cover and the soil moisture are important factors in calculating the dust emissions in SMOKE. However, the anthropogenic fugitive dust implemented in this GFSv15-CMAQv5.0.2 system was not adjusted by the precipitation and snow cover. It will lead to a significant overestimation in the anthropogenic dust emission. The impact of the meteorological factor on anthropogenic fugitive dust emission and the PM 2.5 prediction will be further discussed in Sect. 4. Murphy et al. (2017) found that secondary organic aerosols (SOAs) generated from anthropogenic combustion emissions were important missing PM sources in California prior to CMAQv5.2. The largest underpredictions of PM 2.5 occur in the southeast in summer. Biogenic volatile organic compounds (BVOCs) and biogenic SOA (BSOA) are most active in the southeast region in summer. Many missing sources and mechanisms for SOA formation from BVOCs have been identified in recent years (Pye et al., , 2015Xu et al., 2018) and have resulted in significant improvements in predicting SOA in the southeast using CMAQv5.1 through v5.3. Anthropogenic emissions and aerosol inorganic compounds were found to have impacts on BSOA (Carlton et al., 2018;Pye et al., 2018Pye et al., , 2019. Such interactions and mechanisms are not represented sufficiently in CMAQv5.0.2, further enhancing the biases in predicted PM 2.5 in the southeast. The evaluation of predicted AOD compared to observations from MODIS is shown in Fig. 4. High predicted AOD in the Midwest during cooler months shows consistency with MODIS and corresponds to high surface PM 2.5 predictions. High predicted AOD is missing in California, corresponding to the underprediction of surface PM 2.5 in California. In summer months, AOD is greatly underpredicted in California and the southeast, which may be caused by the previously mentioned missing sources of SOA.

Categorical evaluation
A categorical evaluation is conducted to quantify the accuracy of the GFSv15-CMAQv5.0.2 system in predicting events in which the air pollutants exceed moderate or unhealthy categories for the US AQI (http://www.airnow.gov, last access: 10 August 2020). The scatterplots for predicted and observed MDA8 O 3 and 24 h average PM 2.5 are shown in Fig.  5a and b, respectively. Numbers of the scatters in the four areas (a) to (d) are indicated in the Eqs. (1) and (2)  is because most of the false alarms occur when observed 24 h average PM 2.5 is lower than 20 μg m −3 and the predicted values are higher than 20 μg m −3 . It shows the poorer performance in correctly capturing the category of Unhealthy for sensitive groups due to the significant overprediction of PM 2.5 in cooler months.
Major RT-AQF systems over the world were comprehensively reviewed in Zhang et al. (2012a, b). Here we include a comparison with more recent air quality forecasting studies. Table S3 summarizes air quality forecasting skills reported in the literature from assessments of other air quality forecasting studies from Canada (Moran et al., 2018;Russell et al., 2019), Europe (Struzewska et al., 2016;D'Allura et al., 2018;Podrascanin, 2019;Spiridonov et al., 2019;Stortini et al., 2020), East Asia (Lyu et al., 2017;Zhou et al., 2017;Peng et al., 2018;Ha et al., 2020), and CONUS (Kang et al., 2010;Zhang et al., 2016;Lee et al., 2017), along with that from this work. For those studies with data assimilation in air quality forecasting, the performance from the raw results without data assimilation is presented. The performance in predicting O 3 and PM vary greatly between model systems.
The discrete and categorical performance in O 3 prediction is not significantly better than that in PM prediction. O 3 tends to be slightly overpredicted on an annual basis or for the warmer months.  Kang et al. (2010). The overpredicted PM 2.5 was also found when using the historical 2005 NEI in a forecast for January 2015 . The performance was improved by updates of 2011 NEI and real-time dust and wild fire emissions. It indicates the need to improve our emission inventory. As for the categorical performance in regions other than CONUS, the air quality standards vary (Oliveri Conti et al., 2017). For example, the National Ambient Air Quality Standards (NAAQSs), the Ambient Air Quality and Cleaner Air for Europe (CAFE) Directive ( The overall FAR and detection rate for the four categories are 59.0% and 36.1%, respectively. Although the metrics of the FAR and detection rate were defined for the four categories, rather than for every single category as in this study, the categorical performance is comparable with our results. In general, the discrete and categorical performance of O 3 forecast in this study is comparable to that of the air quality forecasting systems in many regions of the world. However, the PM forecasts vary greatly between studies. While our GFSv15-CMAQv5.0.2 system shows consistent performance with the systems covering CONUS, the high FAR and low H for the Unhealthy for sensitive groups category with higher thresholds indicate that the categorical performance could be further improved by addressing the significant overprediction during cooler months in this study.

Region-specific evaluation
As discussed in Sect. 3.2, biases in predicted O 3 and PM 2.5 vary from region to region. To further analyze the region-specific performance of the GFSv15-CMAQv5.0.2 system, an evaluation for 10 regions within CONUS is conducted. By identifying the detailed characteristics of region-specific biases and indicating the underlying causes for such biases, this section aims to help the NAQFC to improve its forecast ability for specific regions. Consistent with the analysis in Sect. 3.2, PM 2.5 is significantly overpredicted in most of the regions except in regions 4, 6, and 9 (Fig. 6c). The underprediction during warmer months, likely due to missing sources and mechanisms for BSOA, compensates for the annual biases in regions 4 and 6, leading to smaller MBs or NMBs but low correlations in these regions. The variability in predictions is much larger than in observations, with the NSDs > 1 for all regions (Fig. 6d). The forecast system has the best performance in region 9 with an NSD of 1.2, an NMB of −12.0%, and a Corr of 0.40. Figure S8 in the Supplement shows the time series of 24 h average PM 2.5 in the 10 CONUS regions. The gaps between observed and predicted curves are large in cooler months, but the GFSv15-CMAQv5.0.2 system has relatively good performance in warmer months for most of the regions. Less overprediction is found in regions 6 and 9 during cooler months, and those regions generally show the best performance (see Taylor diagram). The different biases across the regions further indicate that multiple factors likely contribute to them.

Meteorology-chemistry relationships
We further quantify the meteorology-chemistry relationships by conducting the regionspecific evaluation of the meteorological variables. The regional performance for the major variables is shown in Fig. S9 in the Supplement. The regional biases in T2 predictions show high correlation with the regional biases in MDA8 O 3 . It indicates that the cold biases in the Midwest (including region 5) and the warm biases near the Gulf coast (including regions of 4 and 6) are important factors for the O 3 underprediction and overprediction in those regions, respectively. The O 3 -temperature relationship was found (Sillman and Samson, 1995;Sillman, 1999). O 3 is expected to increase with increasing temperature within a specific range of temperature (Bloomer et al., 2009;Shen et al., 2016). The surface MDA8 O 3 -temperature relationship was found at approximately 3-6 ppb K -1 in the eastern US (Rasmussen et al., 2012). According to such relationships, the biases in T2 predictions could explain a large portion of the O 3 biases. Heavy convective precipitation and tropical cyclones have a large impact in the southeastern US, which covers mainly regions 4 and 6. Therefore, the performance in precipitation predictions is lower in those two regions compared to other regions as we discussed regarding the model performance in capturing short-term heavy rains during summer seasons in Sect. 3.1. Meanwhile, the performance in wind predictions in regions 4 and 6 is relatively poor. Such performance in the meteorological predictions is consistent with the mixed performance in PM 2.5 prediction in regions 4 and 6. The low temporal agreement shown as correlations of predicted PM 2.5 in those two regions can be attributed to the discrepancy in meteorological inputs, mainly in precipitation and wind.

Major biases in O 3 predictions
Prediction and simulation of O 3 in coastal or marine areas are impacted by halogens chemistry and emissions (Adams and Cox, 2002;Sarwar et al., 2012;Liu et al., 2018), including bromine and iodine chemistry (Foster et al., 2001;Sarwar et al., 2015;Yang et al., 2020) and oceanic halogen emissions (Watanabe, 2005;Tegtmeier et al., 2015;He et al., 2016). CMAQv5.0.2 only has simple chlorine chemistry for CB05 mechanisms, and the reduction of O 3 by reaction with bromine and iodine is not included in CMAQv5.0.2. Iodide-mediated O 3 deposition over seawater and detailed marine halogen chemistry has been found to reduce O 3 by 1-4 ppb near the coast (Gantt et al., 2017), suggesting that the missing halogen chemistry and O 3 deposition processes contribute to overpredicted O 3 in coastal and marine areas seen here. Coastal and marine areas are also impacted by air-sea interaction processes, which are simply represented in the current meteorological models without coupling oceanic models Y. Zhang et al., 2019a, b). For example, coastal O 3 mixing ratios are impacted by predicted sea surface temperatures and land-sea breezes through their influence on chemical reaction conditions and diffusion processes. As discussed in Sects. 3.1 and 4.1, the GFSv15-CMAQv5.0.2 system has poorer performance in predicting the meteorological variables in regions of 4 and 6, which could contribute to biases in O 3 predictions directly or indicate missing land-sea breezes and thus missing transport effects in the GFSv15-CMAQv5.0.2 air quality forecasting system.
In addition to the impact of meteorological biases and missing halogen chemistry on the O 3 overprediction near the Gulf coast, the overestimated volatile organic compound (VOC) emission could enhance the O 3 biases. The anthropogenic VOC emissions continuously decrease from historical NEIs to the 2016 NEI (http://views.cira.colostate.edu/wiki/wiki/ 10202/inventory-collaborative-2016v1-emissions-modeling-90platform, last access: 10 October 2020). We compare the VOC emissions between the 2016 NEI and the emissions used in this study. The difference in the elevated source of pt_oilgas is shown in Fig. S10 in the Supplement. The Gulf coast is impacted by the oil and gas sector due to the oil and gas fields and the exploration activity near it. By comparing the newer NEI to the current NEI we used in the system, we found that the overestimation of the VOCs could be one aspect of the O 3 overprediction near the Gulf Coast because we only project the SO 2 and NO x from the 2005 NEI to 2019 but we do not project the VOCs for the elevated sources. The monthly VOC emissions from the pt_oilgas sector for July in regions 4 and 6 are 2876.0 tmonth −1 , while they are 2497.0 tmonth −1 in the 2016 NEI. The reduction is mainly located along the coastline, where the significant overprediction takes place. It indicates the complicated effect of meteorological biases, missing gas-phase chemistry, and the overestimation of emissions on the O 3 prediction in these regions.
The O 3 concentration is underpredicted for the northeast, mid-Atlantic, Midwest, mountainous states, and northwest (mainly corresponding to the regions 1, 3, 5, 8, and 9) during the non-O 3 season. A large difference in dry-deposition algorithms between CMAQv5.0.2 and other common parameterizations was reported (Park et al., 2014;Wu et al., 2018). A large discrepancy between modeled dry-deposition velocity of O 3 by CMAQv5.0.2 and the observation during winter was shown and attributed to the deposition to snow surface. An improvement was indicated in revising the treatment of deposition to snow, vegetation, and bare ground in CMAQv5.0.2. Lower deposition to snow was found to improve the consistency between the O 3 deposition modeled by CMAQv5.0.2 and the observations. Therefore, the dry-deposition module in v5.0.2 needs to be updated and improved for more accurate representation of low-moderate O 3 mixing ratios (Appel et al., 2021). For the cases in this study, the predicted snow cover for the months of January and April in winter and spring are shown in Fig. 7a and b. The underpredicted O 3 during the non-O 3 season may be caused by the overestimated O 3 deposition to snow in the northern regions, corresponding to the previous regions 1, 3, 5, 8, and 9. The mixed effects of the temperature-O 3 relationship discussed above and the large deposition to snow contribute to the moderate O 3 underpredictions.

Major biases in PM 2.5 predictions
Major biases in PM 2.5 prediction are distinguished for warmer and cooler months in Sect. 3. To further analyze the underlying causes for varied patterns and performance on a seasonand region-specific basis, diurnal evaluations for PM 2.5 and chemical components of PM 2.5 during the O 3 season and the non-O 3 season are shown in Fig. 8. GFSv15-CMAQv5.0.2 has a large seasonal variation in diurnal PM 2.5 , inconsistent with the observation. While PM 2.5 is underpredicted during daytime in regions 4, 6, 8, and 9 during the O 3 season, PM 2.5 is always overpredicted across the day during the non-O 3 season except for region 9. Increased organic carbon (OC), particulate nitrates, soil and unspecified coarse-mode components contribute to most of the increase in predicted total PM 2.5 . The general cold biases over CONUS, especially in region 5, could make the GFSv15-CMAQv5.0.2 system predict higher nitrate particulates, leading to a larger increase in PM 2.5 from the O 3 season to the non-O 3 season. Emissions vary from month to month in the year (Fig. S11a in the Supplement). There are larger emissions for NH 3 , NO x , VOC, primary coarse PM, and primary PM 2.5 in the O 3 season compared to the non-O 3 season. Primary organic carbon (POC) emissions are higher in the O 3 season. Changes in emissions are not fully consistent with the changes in PM 2.5 components, indicating that other biases or uncertainty could also contribute to the significant overprediction during the non-O 3 season. For example, the implementation of a bidirectional flux of NH 3 and the boundary layer mixing processes under more stable conditions (during the non-O 3 season) in the GFSv15-CMAQv5.0.2 system need to be further studied. Pleim et al., (2013Pleim et al., ( , 2019 found that the NH 3 fluxes and concentrations could be better simulated and the monthly variations in NH 3 concentrations were larger compared to the raw model by implementing the bidirectional flux of NH 3 . The absolute biases for diurnal PM 2.5 are generally larger during nighttime in most of the regions, except for region 9. This is consistent with the analysis by Appel et al. (2013), which suggested that the efforts of improving nighttime mixing in CMAQv5.0 are further needed, further indicating the need for improvements of CMAQ in predicting dispersion and mixing of air pollutants under stable boundary layer conditions. The forecast system gives the highest PM predictions at two peaks during the day: 06:00 and 19:00 in the O 3 season and 07:00 and 20:00 in the non-O 3 season at LST, respectively, corresponding to the shifting between day-light saving time and LST. The two diurnal peaks are caused by the diurnal pattern of emissions (Fig. S11b). PM is mostly emitted during the daytime from 06:00 to 18:00. With the development of the boundary layer during the daytime, surface PM 2.5 concentrations will be reduced by the diffusion. During dawn and dusk, the boundary layer transits between stable and well-mixed conditions. The increased emission and secondary production of PM 2.5 will be accumulated within the boundary layer, causing the high peaks during dawn and dusk.
The variation in predicted PM 2.5 composition between cooler and warmer months indicates that major seasonal biases are caused by multiple factors. We introduce the Air Quality System (AQS) dataset for the evaluation of daily PM 2.5 composition to provide additional insight into the specific reasons. Figure 9 shows the biases of the key PM 2.5 composition for the cooler month of January and warmer month of July. While the overall mean biases of  (Appel et al., 2013). During a cooler month, the significant overprediction in PM 2.5 is mainly attributed to the overprediction in OC and SOIL. During warmer months, the overprediction of SOIL and sulfate compensates for the overall underprediction in OC in v5.0.2, leading to the moderate PM 2.5 underprediction in the southeast but slight overprediction in the Midwest, mid-Atlantic, and the northeast. These high PM 2.5 SOIL concentrations are consistent in spatial characteristics with large emissions of anthropogenic primary PM 2.5 and primary coarse PM in the Midwest, northeast, and northwest. The underprediction in PM 2.5 OC during summer compensates for the overestimation in dust during cooler months, resulting in the overall biases with an annual NMB of 30.0%.
The large emissions of anthropogenic primary coarse PM as well as the wind-blown dust are the major sources for predicted PM 2.5 SOIL components. Appel et al. (2013) indicated CMAQ overpredicted soil components in the eastern United States partially due to the anthropogenic fugitive dust and wind-blown dust emissions. The overprediction in PM 2.5 soil compositions by our forecast system could mainly be attributed to the overestimation of the anthropogenic fugitive dust emission because the meteorological conditions were not included in processing the anthropogenic fugitive dust sector. The dust-related components of aluminum, calcium, iron, titanium, silicon, and coarse-mode particles are overestimated in the regions with snow and precipitation, especially during winter, early spring, and late autumn with snow cover in the north, which contributes to the PM 2.5 overprediction, with a more significant temporal-spatial pattern in the north US during cooler months.
An adjustment of precipitation and snow cover for fugitive dust was implemented in the operational NAQFC. The dust-related PM emissions will be cleaned up using a factor of 0.01 when the snow cover is higher than 25% or the hourly precipitation is higher than 0.1 mm h −1 before they are used as input for CMAQv5.0.2 forecast. We conduct a sensitivity simulation for January 2019 using the GFSv15-CMAQv5.0.2 system with the adjustment implemented in the operational NAQFC. Figure 7c shows that the PM 2.5 overprediction in the northern regions 1, 2, 5, and 10 during January is greatly improved corresponding to the spatial-temporal characteristics of snow cover. The monthly MB and NMB for January improves from 5.5 μg m −3 and 66.9% to 2.1 μg m −3 and 24.0%, respectively. The improvement is mainly attributed to the decrease in overpredictions in PM 2.5 soil components, with MBs decreased from 3.3 to 1.2 μg m −3 for January (Fig. 7d). The overprediction in the northeast and northwest during spring is expected to be improved by the suppression of the fugitive dust by the snow during early spring. This indicates the importance of including the meteorological forecast in processing the emission of anthropogenic fugitive dust. It should be calculated inline or be adjusted by the meteorological forecast.
In CMAQv5.0.2, the primary organic aerosol (POA) is processed as non-volatile. The emissions of semivolatile and intermediate-volatility organic compounds (S/IVOCs) and their contributions to the SOA are not accounted for in the aerosol module. In the recent versions of CMAQ, two approaches linked to POA sources have been implemented. One introduces semivolatile partitioning and gas-phase oxidation of POA emissions. The other one (called pcSOA) accounts for multiple missing sources of anthropogenic SOA formation, including potential missing oxidation pathways and emissions of IVOCs. These two improvements lead to increased organic carbon concentration in summer but a decreased level in winter. The changes vary by season as a result of differences in volatility (as dictated by temperature and boundary layer height) and reaction rate between winter and summer. Therefore, the missing S/IVOCs and related SOA chemistry in v5.0.2 are key reasons for the OC overprediction and underprediction during cooler and warmer months, respectively.

Conclusions
In this work, the air quality forecast for the year 2019 predicted by the offline-coupled GFSv15-CMAQv5.0.2 system is comprehensively evaluated. The GFSv15-CMAQv5.0.2 system is found to perform well in predicting surface meteorological variables (temperature, relative humidity, and wind) and O 3 but has mixed performance for PM 2.5 . Moderate cold biases and wet biases are found in the spring season, especially in March. While the GFSv15-CMAQv5.0.2 system can generally capture the monthly accumulated precipitation compared to remote-sensing and ensemble datasets, temporal distributions of hourly precipitation show less consistency with in situ monitoring data.
MDA8 O 3 is slightly overpredicted and underpredicted in ozone and the non-O 3 seasons, respectively. The significant overprediction near the Gulf Coast is associated with the missing halogen chemistry, overestimated emission of precursors, and the poorer performance in meteorological performance, which could be attributed to the missing model representation of the air-sea interaction processes. It compensates for underprediction in the west and Midwest in the O 3 season for nationwide metrics. A slight underprediction is found during the non-O 3 season, indicating the impact of cold biases of T2 and the overestimated dry deposition to the snow surface. GFSv15-CMAQv5.0.2 has poorer performance in predicting PM 2.5 , compared to the performance for O 3 . Significant overpredictions are found in cooler months, especially in winter. The largest overprediction is shown in the Midwest and the states of Washington and Oregon due mainly to high concentrations of predicted fine fugitive, coarse-mode, and OC compositions. The lacking suppression of snow cover on anthropogenic fugitive dust emission and the non-volatile approach for POA emission contribute a major portion of the overprediction in winter. Meanwhile, the forecasting system may be improved through updating the emissions inventory used (i.e., NEI 2014) to NEI 2016v2 or NEI 2017, which are more representative of the year of 2019 in the next development of next-generation NAQFC.
Categorical evaluation indicates that the GFSv15-CMAQv5.0.2 can capture well the air quality classification of the Moderate category described by the AQI. However, the categorical performance is poorer for PM 2.5 at the Unhealthy for sensitive groups threshold due mainly to the significant overprediction during the cooler months. Region-specific evaluation further discusses the biases and underlying causes in the 10 U.S. EPA defined regions in CONUS. An update from CMAQv5.0.2 to v5.3.1 is expected to alleviate potential errors in missing sources and mechanisms for SOA formation. The variations of performance between O 3 and non-O 3 seasons, as well as during the daytime and nighttime, indicate that further studies need to be conducted to improve boundary layer mixing processes within GFSv15-CMAQv5.0.2. The varied region-specific performance indicates that improvements, such as bias corrections, should be considered individually from region to region in the subsequent development of the next-generation NAQFC.
We have used bias analyses in this work to identify several areas of weakness in the GFSv15-CMAQv5.0.2 system for further improvement and development of next-generation NAQFC. The ability of FV3-based GFS in driving the real-time air quality forecasting is demonstrated. Further studies are still needed to improve the accuracy in meteorological forecast, the emissions, the aerosol chemistry, and the boundary layer mixing for the future GFS-FV3-CMAQ system.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.