Methods for assessment of models 12 Mar 2019
Methods for assessment of models  12 Mar 2019
A new method (M^{3}Fusion v1) for combining observations and multiple model output for an improved estimate of the global surface ozone distribution
 ^{1}National Research Council Research Associateship Program, David Skaggs Research Center, Boulder, CO, USA
 ^{2}NOAA Earth System Research Laboratory, Boulder, CO, USA
 ^{3}Cooperative Institute for Research in Environmental Sciences, University of Colorado, Boulder, CO, USA
 ^{4}Department of Environmental Sciences & Engineering, University of North Carolina, Chapel Hill, NC, USA
 ^{5}Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Jülich, Germany
 ^{6}NOAA Geophysical Fluid Dynamics Laboratory, Princeton, NJ, USA
 ^{7}Program in Atmospheric and Oceanic Sciences, Princeton University, Princeton, NJ, USA
 ^{8}MétéoFrance, Centre National de Recherches Météorologiques, Toulouse, France
 ^{9}Meteorological Research Institute (MRI), Tsukuba, Japan
 ^{10}Graduate School of Environmental Studies, Nagoya University, Nagoya, Japan
 ^{11}Japan Agency for MarineEarth Science and Technology (JAMSTEC), Yokosuka, Japan
 ^{12}NASA Goddard Space Flight Center, Greenbelt, MD, USA
 ^{13}Universities Space Research Association, Columbia, MD, USA
 ^{14}John A. Paulson School of Engineering and Applied Science, Harvard University, Cambridge, MA, USA
 ^{1}National Research Council Research Associateship Program, David Skaggs Research Center, Boulder, CO, USA
 ^{2}NOAA Earth System Research Laboratory, Boulder, CO, USA
 ^{3}Cooperative Institute for Research in Environmental Sciences, University of Colorado, Boulder, CO, USA
 ^{4}Department of Environmental Sciences & Engineering, University of North Carolina, Chapel Hill, NC, USA
 ^{5}Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Jülich, Germany
 ^{6}NOAA Geophysical Fluid Dynamics Laboratory, Princeton, NJ, USA
 ^{7}Program in Atmospheric and Oceanic Sciences, Princeton University, Princeton, NJ, USA
 ^{8}MétéoFrance, Centre National de Recherches Météorologiques, Toulouse, France
 ^{9}Meteorological Research Institute (MRI), Tsukuba, Japan
 ^{10}Graduate School of Environmental Studies, Nagoya University, Nagoya, Japan
 ^{11}Japan Agency for MarineEarth Science and Technology (JAMSTEC), Yokosuka, Japan
 ^{12}NASA Goddard Space Flight Center, Greenbelt, MD, USA
 ^{13}Universities Space Research Association, Columbia, MD, USA
 ^{14}John A. Paulson School of Engineering and Applied Science, Harvard University, Cambridge, MA, USA
Correspondence: KaiLan Chang (kailan.chang@noaa.gov)
Hide author detailsCorrespondence: KaiLan Chang (kailan.chang@noaa.gov)
We have developed a new statistical approach (M^{3}Fusion) for combining surface ozone observations from thousands of monitoring sites around the world with the output from multiple atmospheric chemistry models to produce a global surface ozone distribution with greater accuracy than can be provided by any individual model. The ozone observations from 4766 monitoring sites were provided by the Tropospheric Ozone Assessment Report (TOAR) surface ozone database, which contains the world's largest collection of surface ozone metrics. Output from six models was provided by the participants of the ChemistryClimate Model Initiative (CCMI) and NASA's Global Modeling and Assimilation Office (GMAO). We analyze the 6month maximum of the maximum daily 8 h average ozone value (DMA8) for relevance to ozone health impacts. We interpolate the irregularly spaced observations onto a fineresolution grid by using integrated nested Laplace approximations and compare the ozone field to each model in each world region. This method allows us to produce a global surface ozone field based on TOAR observations, which we then use to select the combination of global models with the greatest skill in each of eight world regions; models with greater skill in a particular region are given higher weight. This blended model product is bias corrected within 2^{∘} of observation locations to produce the final fused surface ozone product. We show that our fused product has an improved mean squared error compared to the simple multimodel ensemble mean, which is biased high in most regions of the world.
 Article
(10617 KB) 
Supplement
(3110 KB)  BibTeX
 EndNote
Tropospheric ozone is a pollutant detrimental to human health and has been associated with a range of adverse cardiovascular and respiratory health effects due to shortterm and longterm exposure (World Health Organization, 2005; Jerrett et al., 2009; US Environmental Protection Agency, 2013; GBD, 2015; Turner et al., 2016; Cohen et al., 2017). Assessing the human health impacts of ozone on the global scale requires accurate exposure estimates at any given inhabited location (Shaddick et al., 2018). Due to the limited availability of surface ozone observations in many regions of the world (Fleming et al., 2018), global atmospheric chemistry models are required to calculate surface ozone exposure. Despite continual development and improvement, global models struggle in their ability to accurately simulate ozone in all regions of the world (Young et al., 2018). The ability to accurately simulate observed ozone at a particular location also varies between models, as demonstrated by several multimodel comparisons (Stevenson et al., 2006; Young et al., 2013; Cooper et al., 2014).
A useful endeavor for producing an accurate representation of the global surface ozone distribution is to combine the output from many models in a way that takes advantage of the strengths of each model and minimizes the weaknesses. Such efforts have already been made for both climate and chemistry–climate models. For example, multimodel output has been combined using a parametric approach, either by assigning an equal or optimum weight to each model (Stevenson et al., 2006; He and Xiu, 2016; Braverman et al., 2017) or by tuning the initial conditions under different scenarios or parameterizations (Cariolle and Teyssèdre, 2007; Wu et al., 2008; Young et al., 2013). These approaches often assume that individual model biases will at least partly cancel by averaging or weighting, according to certain measures of predictive performance. Thus, the combined model product is likely to be more accurate than a single model prediction, as has been shown for multimodel combinations of past or presentday climate (Buser et al., 2009; Knutti et al., 2010; Weigel et al., 2010; Chandler, 2013).
For the case of simply averaging the output from multiple climate models, most studies either explicitly or implicitly assume that every model is independent and is a random sample from a distribution, with the true climate as its unbiased mean. This implies that the average of a set of models converges to the true climate as more and more models are added. This multimodel ensemble often outperforms any single model in terms of the predictive capability. Undeniably, when one has several dozen or hundreds of possible ensemble members, the most straightforward and efficient approach is to simply take the ensemble average, ignoring the impact of potentially erroneous outlier ensemble members. From a statistical point of view, one might argue that ruling out potentially erroneous ensemble members prior to conducting the ensemble mean would yield an even better result, especially if the overall number of ensemble members is small.
Combining model ensembles using a method more sophisticated than the simple average is a challenge because a meaningful model evaluation can rarely be condensed into a single metric, and there is no technique that can explicitly quantify the degree of similarity (i.e., both accuracy and precision) between two different spatial fields (Hyde et al., 2018). Indeed, Stainforth et al. (2007) concluded that any attempt to assign weights is, in principle, inappropriate. With a lack of appropriate criteria, the model weighting approach has not become a standard alternative to the ensemble average. Accordingly, there is presently no objective criterion for combining surface ozone estimates from a model ensemble to produce a surface ozone product with improved accuracy beyond that of any ensemble member or the simple ensemble mean. The absence of such a methodology is the motivation for this paper.
This paper presents a new statistical approach (M^{3}Fusion) for combining surface ozone output from multiple atmospheric chemistry models with all available surface ozone observations to produce a global surface ozone distribution with greater accuracy than the multimodel ensemble mean. As described in greater detail below, this fused surface ozone product is constructed in three steps: (1) ozone observations from all available surface ozone monitoring sites around the world are spatially interpolated to a smooth global field; (2) for each of eight continental regions of the world, six global atmospheric chemistry models are evaluated against the interpolated observed ozone field by a quadratic programming optimization, with the most accurate models receiving the highest weight; a locally confined spline interpolation is used at the regional boundaries to avoid unphysical step changes; (3) finally, the global ozone field derived from the polynomial equation is bias corrected but only within a limited distance from available observations. The final product is based on the annual maximum of the 6month running mean of the monthly average daily maximum 8 h average mixing ratios (DMA8), a metric that can be used to estimate human mortality due to longterm ozone exposure (Turner et al., 2016; Malley et al., 2017; Seltzer et al., 2018; Shindell et al., 2018).
Past estimates of global mortality due to longterm ozone exposure have relied on surface ozone fields produced by global atmospheric chemistry models due to the limited coverage of the global ozone monitoring network (Anenberg et al., 2010; Brauer et al., 2012, 2015; Malley et al., 2017). The fused surface ozone product is a blend of global surface ozone observations and model output that has been adjusted according to the observations. This particular product will be available for future estimates of global human mortality due to longterm ozone exposure (e.g., Global Burden of Disease, Brauer et al., 2012, 2015). Furthermore, the methodology can be applied to a range of ozone metrics for quantifying the impacts of ozone on human health, or vegetation, and it can also be applied to PM_{2.5}, CO_{2} or any other trace gas.
Section 2 provides details of the data sources and fusion process, including the techniques to register all data sources onto a common grid and the statistical model used to minimize the difference between interpolated observations and the multimodel combination. In Sect. 3, the results of employing these techniques are presented, including the mapping accuracy, evaluation of regional model performance and the final multimodel bias correction. The paper concludes with a summary and discussion in Sect. 4.
2.1 Observations and model output

Tropospheric Ozone Assessment Report (TOAR) surface ozone database. In this analysis, surface ozone observations are used to evaluate the performance of six global atmospheric chemistry models and to also bias correct the multimodel surface ozone product. TOAR has produced the world's largest database of surface ozone metrics based on hourly observations at over 9000 sites around the globe (Schultz et al., 2017, ozone metrics available for download at https://doi.org/10.1594/PANGAEA.876108). Spatial coverage is high in North America, Europe, South Korea and Japan, but much lower across the rest of the world with very low data availability across Africa, the Middle East, Russia and India. In addition to data sparseness, other challenges, such as data inhomogeneity in time and the irregular spatial distribution of stations (Chang et al., 2017), make the comparison between model output and observations difficult without serious statistical modeling. While satellite retrievals have been utilized by previous works for quantifying the health impacts of PM_{2.5} (Brauer et al., 2012, 2015), satellite retrievals of tropospheric ozone have limited sensitivity near the surface and are inadequate for this analysis (Gaudel et al., 2018).
TOAR has gathered ozone observations through 2014 at most sites and has chosen 2008–2014 as a “presentday” window for more rigorous analysis. The purposes of the multiyear average are to reduce the effects of ozone interannual variability, which is largely driven by changes in meteorological conditions (Strode et al., 2015), and to increase the number of available sites than if we used a single year. In this analysis, we focus on the annual maximum of the 6month running mean of the maximum daily 8 h average (DMA8) at every site in the TOAR database. Specifically, the metric was calculated from the 6month running mean of the monthly mean DMA8 ozone values at a given site. This metric was selected because it aligns with the ozone metric used by Turner et al. (2016) to quantify the impact of longterm ozone exposure on human mortality. Hereinafter, this quantity is simply referred to as “the ozone metric”.

Atmospheric chemistry model simulations. We use output from models from phase 1 of the ChemistryClimate Model Initiative (CCMI), downloaded from the Centre for Environmental Data Analysis (CEDA) database (http://archive.ceda.ac.uk, CEDA, 2019). We chose four models (CHASER, GEOSCCM, MOCAGE and MRIESM1r1) because they report hourly ozone output (Table 1). These particular simulations were part of CCMI's REFC2 experiment (Morgenstern et al., 2017), which follows the World Meteorological Organization (2011) A1 scenario for ozone depleting substances, and RCP6.0 for tropospheric ozone precursors, and aerosol and aerosol precursor emissions (Morgenstern et al., 2010) for the period 1960–2100. Even though the most appropriate experiment would have been the REFC1SD, in which the models are nudged to the reanalysis meteorology and thus best represent the past in the observations, we use output from the REFC2 simulation in this study, as the last year of the REFC1SD was 2010 and would therefore not cover the most recent period where observations are available. However, the NOAA Geophysical Fluid Dynamics Laboratory (GFDL) AM3 model continued the simulation over the entire study period and was therefore selected for this analysis. In addition, we obtained output from the GEOS5 nature run with chemistry (G5NRChem), provided by the NASA Global Modeling and Assimilation Office (GMAO), which we included in our analysis because of the model's very fine horizontal resolution (Hu et al., 2018), but the output was only available for July 2013 to June 2014.
The output from each individual model is shown in Fig. S1 in the Supplement. Note that NASA G5NRChem has the finest resolution of these models; accordingly, we aim to produce our final product on the same $\mathrm{0.125}{}^{\circ}\times \mathrm{0.125}{}^{\circ}$ grid. However, even at this resolution, the output is not street resolving and thus will not capture urbanscale variability in the regions with the highest population density.
^{a} Meteorological forcing includes coupled ocean–atmosphere (C2) and nudged to observed reanalysis meteorology (C1SD) in CCMI reference simulations (Morgenstern et al., 2017). ^{b} The specification of forcing scenario for this special run is described by Hu et al. (2018).
In order to compare model output to observations, we need to register model output and observations to a common grid. This registration enables us to quantify the differences between the models and observations. Previous attempts have usually relied on a variant from a general statistical interpolation framework to combine incompatible spatial data (Gotway and Young, 2002; Fuentes and Raftery, 2005; Gelfand and Sahu, 2010; Berrocal et al., 2012; Nguyen et al., 2012). Due to the highly irregular locations of ozone monitors around the globe, we use a kriging technique to build a statistical model, interpolate the ozone distribution based on the surrogate and then project the global surface onto a common grid.
2.2 Fusion of observations and models
Following is a description of our method for fusing observations and output from multiple global atmospheric chemistry models to produce a surface ozone product with maximized accuracy. This method is known as Measurement and MultiModel Fusion (version 1), or M^{3}Fusion (v1), and the code accompanies this paper in the Supplement. We consider a general framework of uncertainty quantification consisting of the following components (Kennedy and O'Hagan, 2001; Chang and Guillas, 2019):
Since this equation requires matching components (observations and model output) on a common grid, we use the interpolated observations to estimate an optimized weight for each model by a L^{2} norm (details are given later), which means that we expect the multimodel combination to capture the general pattern of the surface ozone distribution in terms of their joint predictive capability, and the model bias is considered as a model correction term. The difference between observation error and model bias is that the former term is assumed to be a normal noise with zero mean and constant variance, and the latter term is considered as a systematic and structured discrepancy (Williamson et al., 2015), which will be revealed as a spatial cluster across a poorly simulated region.
Due to this study's human health focus, we do not consider ozone above the datasparse oceans. Above land, large observational gaps are present across Africa, the Middle East, South America, and south and southeast Asia, where the spatial interpolation is generally too uncertain to yield a reliable surface ozone approximation. The ozone estimates in these regions must come from either models or distant observations, neither of which is ideal to solve this issue. As a compromise strategy, we fill these gaps with a weighted model product evaluated by the interpolated ozone observations. We propose the following procedure to combine model output and observations for data integration:

Interpolating irregularly located monitoring observations to the model output grid. Kriging is a procedure used to statistically interpolate irregularly spaced and/or sparse observed data onto a regular and dense grid, based on a weighted average of the fitted surrogate model in the neighborhood of the grid. We assume that the global ozone distribution can be approximated by a Gaussian spatial process (GP) with a constant mean and Matérn covariance function (Stein, 2012). The GP fitting typically involves a cubic complexity and thus is computationally expensive for large spatial data sets. Therefore, several alternatives have been developed to address the large n issue by using a reduced set of data (Cressie and Johannesson, 2008; Banerjee et al., 2012; Liang et al., 2013), tapering the covariance between two grid points to zero if their distance is beyond a certain range (Furrer and Sain, 2009; Sang and Huang, 2012) and/or evaluating the covariance only through the specification of a neighborhood system (also known as the Gaussian Markov random field) (Rue et al., 2009; Lindgren et al., 2011).
In this study, we carry out the spatial interpolation by using the combination of the integrated nested Laplacian approximation (INLA) framework (Rue et al., 2009) and the stochastic partial differential equation (SPDE) technique (Lindgren et al., 2011), available as an R package (http://www.rinla.org/) (Lindgren and Rue, 2015). The details of this technique are rather complex and the reader is referred to the original paper (Lindgren et al., 2011); however, we describe the key component of this INLASPDE technique in Appendix A. INLASPDE spatial modeling has proven to be effective in a wide range of applications (Cameletti et al., 2013; Shaddick and Zidek, 2015; Heath et al., 2016; Liu and Guillas, 2017; Rue et al., 2017). We chose this technique because it manages a fairly large and complex spatial field in a relatively efficient way (Rue and Held, 2005) and allows an extension for nonstationarity on the sphere (Bolin and Lindgren, 2011; Chang et al., 2015). Notably, a recent study elaborately compared dozens of spatial modeling approaches, and the results suggest that almost all of these approaches can achieve a similar performance in terms of their predictive accuracy, albeit with very different computation times (Heaton et al., 2018). Therefore, we expect that the choice of spatial modeling approach is not the most crucial component in our data fusion process as long as the analysis is carried out in a rigorous way (i.e., through the statistical model selection and diagnostics). To differentiate this result from the actual observations in the TOAR database, we refer to this interpolated surface as the “spatially interpolated ozone”.
We carry out the statistical interpolation via the following steps: (1) calculate the ozone metric at each TOAR site and for every year in 2008–2014; (2) perform the statistical interpolation using all available sites with their exact coordinates and project the surface onto a $\mathrm{0.125}{}^{\circ}\times \mathrm{0.125}{}^{\circ}$ spherical grid for every year; (3) average these surfaces over the 7 years to yield an observationbased presentday ozone distribution. We expect that this aggregation will smooth out at least some of the potential uncertainties. The kriging can be seen as a nonparametric regression problem; therefore, a statistical assessment of fitted quality must be considered to select the best representation to the data (Hoeting et al., 2006). Further details on the statistical model selection procedure are provided in Appendix A.
We use a bilinear interpolation to smooth model output from coarser resolution to a $\mathrm{0.125}{}^{\circ}\times \mathrm{0.125}{}^{\circ}$ grid (Jun et al., 2008). The ozone metric for each model was calculated for each single grid in each year, then averaged over 2008–2014 (except for NASA G5NRChem, which was already in fine resolution, but only available for 1 year).

Weighting model output against spatially interpolated ozone by region. The next step is to create an intermediate “multimodel composite”. We divide the global land surface into eight regions (see Fig. 1), roughly matching the continental outlines or major population regions. We adopt this regional approach because global models vary in their ability to simulate ozone in different regions of the world. Next, we regress the observations on multimodel output by a constrained least square approach within each of the eight regions. Let s_{g} be the grid cell at resolution $\mathrm{0.125}{}^{\circ}\times \mathrm{0.125}{}^{\circ}$, $\widehat{y}\left({s}_{\mathrm{g}}\right)$ be the interpolated observations and $\mathit{\left\{}{\mathit{\eta}}_{k}\right({s}_{\mathrm{g}});k=\mathrm{1},\phantom{\rule{0.125em}{0ex}}\mathrm{\dots},\mathrm{6}\mathit{\}}$ be the model output registered onto the same grid from the six models considered in this paper (Table 1). The optimization equation is based on a constrained least squares approach:
$$\begin{array}{ll}\text{(1)}& {\displaystyle}& {\displaystyle}\underset{\mathit{\{}{\mathit{\alpha}}_{\mathrm{r}},{\mathit{\beta}}_{r\phantom{\rule{0.125em}{0ex}}k};k=\mathrm{1},\mathrm{\dots},\mathrm{6}\mathit{\}}}{\text{minimize}}\sum _{{s}_{\mathrm{g}}\in \text{Region}\phantom{\rule{0.25em}{0ex}}r}{\left(\widehat{y}\left({s}_{\mathrm{g}}\right){\mathit{\alpha}}_{\mathrm{r}}\sum _{k=\mathrm{1}}^{\mathrm{6}}{\mathit{\beta}}_{r\phantom{\rule{0.125em}{0ex}}k}{\mathit{\eta}}_{k}\left({s}_{\mathrm{g}}\right)\right)}^{\mathrm{2}},{\displaystyle}& {\displaystyle}\text{subjectto}\phantom{\rule{0.25em}{0ex}}\phantom{\rule{0.25em}{0ex}}\sum _{k=\mathrm{1}}^{\mathrm{6}}{\mathit{\beta}}_{r\phantom{\rule{0.125em}{0ex}}k}=\mathrm{1}\phantom{\rule{0.25em}{0ex}}\phantom{\rule{0.25em}{0ex}}\text{and}\phantom{\rule{0.25em}{0ex}}\phantom{\rule{0.25em}{0ex}}{\mathit{\beta}}_{r\phantom{\rule{0.125em}{0ex}}k}\ge \mathrm{0}.,\end{array}$$where α_{r} is a constant that allows adjustment to the overall (regional) underestimation or overestimation; β_{r k} is an optimal weight for the kth model in region r. Note that since the interpolated observations and models use the same ozone metric with the same units, we constrain the weights to be positive and sum to 1 for a better physical interpretability, such that the most accurate models receive the higher weight. The offset term α_{r} is aimed to adjust the overall residuals between the observation field and the multimodel composite into zero mean in each region (regardless of the spatial pattern); therefore, if two spatial fields share a great similarity in terms of their spatial curvatures, but the overall means are different, this term can fill the gap of the overall mean difference between the two fields. The weights are optimized in terms of the squared distance between the interpolated ozone and multimodel output. A different criterion of optimization, such as mean absolute error, can be established accordingly.
Due to the sparsity of stations in many regions, we use a predefined geometric boundary to differentiate regions. A more meaningful physical boundary (i.e., regions with similar chemical regimes or major features such as deserts, mountain ranges or water bodies) might be determined using a cluster analysis technique (Hyde et al., 2018), but such a step is beyond the scope of this paper.
Since we partition the global land surface into eight regions and evaluate the models individually, inevitably there will be disjointed boundaries between regions. The boundaries between North and South America, or between east Asia and Oceania, fall mostly in the oceans, so we do not need to adjust these regions. However, we should make an adjustment to disjointed boundaries that fall across inhabited areas (see Fig. S2 for the illustration). As an example of our method, consider the boundary between east Asia and Russia near 50^{∘} N. We increase the northern boundary of east Asia to 55^{∘} N and decrease the southern boundary of Russia to 45^{∘} N, to create an overlapping intersection, and then fit cubic splines (performed for each grid cell) with knots placed at every 2^{∘} grid cell (Wood et al., 2008). This smoothing is carried out using a lowrank Gaussian process by the default penalized least square from the function “gam” in the R package mgcv (Kammann and Wand, 2003; Wood, 2017), following the examples of Wood and Augustin (2002). The purpose is to merely avoid a sharp and unrealistic (geometric) transition between three regions and to efficiently smooth out the discontinuity, performed in a regular spaced grid only around the geometric boundary. Any region away from the geometric boundary will not be affected by this smoothing, which should be considered as a blending of multiple models without any attempt of bias correction. It should be noted that the INLASPDE technique in step 1 is applied to the observations, while the smoothing spline is only applied to the boundaries between regions of the model composite, not directly involving any observations.
We adopted a regression weighting approach that only accounts for the mean spatial fields of the interpolated ozone and model output, rather than the underlying associated uncertainty. We take this approach due to the prohibitive size of highresolution output (over 1 million output points for each model) but also due to the lack of a thorough investigation regarding the ideal method for combining models based on different sources of uncertainty. For example, the interpolation uncertainty can be quantified easily through the posterior distribution and considered to be related to measurement error (small scale) or sparse sampling across a region (large scale); however, model uncertainty is a different concept altogether that could result from input uncertainty (e.g., air pollution emissions inventories) or limitations of the transport and chemistry mechanisms within the model (Brynjarsdóttir and O'Hagan, 2014). The current interest of this study focuses on a better estimate of mean ozone exposure. Explicit quantification of different sources of model uncertainty and incorporation of this information into the data fusion process presents another level of complexity that cannot be tackled until model uncertainties are better characterized. Young et al. (2018) provide a current overview of chemistry–climate modeling and discuss the challenges of improving models in light of so many uncertainties.

Correcting multimodel bias in areas close to observations. A common practice of studying the model discrepancy in the spatial fields is to fit a statistical model for their differences from observations on the whole spatial domain, to see whether or not these residuals reveal any structured spatial pattern (Jun and Stein, 2004; Sang et al., 2011). If the model adequately simulates the ozone distribution (up to a level shift and a scale factor), then there is no relevant information in these residuals. On the other hand, if the model does not properly represent the local structure, then the residuals should exhibit a signal of the discrepancy in that region (Guillas et al., 2006; Williamson et al., 2015). However, in our case, the regular grid observation field is obtained from spatial kriging, such that in many datasparse regions we do not actually have observed ozone, which prevents us from correcting the model in these regions. Instead, we conduct a limited model bias correction based on the distance to the nearest monitoring station, but we ignore the differences between the multimodel composite and the interpolated observations in the sparsely monitored regions. In our approach, we only correct the output grid where there is at least one observational station within a 2^{∘} radial distance of the grid cell in question (i.e., the distance to the nearest station is less than 2^{∘}). We then end up using
$$}\left\{\begin{array}{l}\widehat{y}\left({s}_{\mathrm{g}}\right),\phantom{\rule{0.25em}{0ex}}\text{if a grid cell}\phantom{\rule{0.25em}{0ex}}{s}_{\mathrm{g}}\phantom{\rule{0.25em}{0ex}}\text{is within a}\phantom{\rule{0.25em}{0ex}}\mathrm{2}{}^{\circ}\phantom{\rule{0.25em}{0ex}}\text{radial distance}\\ \text{of the station};\\ {\mathit{\alpha}}_{\mathrm{r}}+\sum _{k=\mathrm{1}}^{\mathrm{6}}{\mathit{\beta}}_{r\phantom{\rule{0.125em}{0ex}}k}{\mathit{\eta}}_{k}\left({s}_{\mathrm{g}}\right),\phantom{\rule{0.25em}{0ex}}\text{otherwise},\end{array}\right.$$to generate our highresolution global surface ozone estimate. Given the limited availability of observations worldwide, we were only able to apply this final bias correction to 14.4 % of the globe's land area. We refer to the final outcome as the “fused surface ozone product”.
3.1 Mapping and uncertainty
Groundbased measurements were available from 4766 stations reported in the TOAR database (Schultz et al., 2017). To illustrate the spatial coverage of the database, Fig. 1 shows the ozone metric discretized to a $\mathrm{2}{}^{\circ}\times \mathrm{2}{}^{\circ}$ grid (a finer resolution will be too obscure for illustrative purposes), averaged over the period 2008–2014. This figure also shows our regionalized classification, including Africa, North America, South America, east Asia, southeast and central Asia, Europe, Oceania and Russia. Note that dense station networks are found in North America, Europe and east Asia (mostly in Japan and South Korea), while monitoring sites are more widely scattered across the remaining regions. The highest average ozone levels are found at sites in China, South Korea, Japan, Taiwan, India, Greece, California and Mexico City.
Figure 2a shows the spatially interpolated surface in each cell. For each grid cell, there is an underlying (posterior) probability distribution which incorporates information about the interpolation uncertainty. Figure 2b shows the half width of the 95 % posterior credible interval in each cell (Shaddick et al., 2018). From the spatial pattern of uncertainty, we can see that relatively higher uncertainties are expected in Africa, the Middle East, south Asia and Russia, regions with very limited observations; lower uncertainty is associated with regions with a dense station network, such as North America and Europe. Due to the limitations of spatial kriging in a sparsely monitored region, the observations are often interpolated across very great distances, such as in South America, Africa and central Asia. This method is not ideal, and instead information from models can be used to fill in the blanks.
The ozone metric for each model was calculated for each individual grid cell in each year, then averaged over 2008–2014 and registered to the common $\mathrm{0.125}{}^{\circ}\times \mathrm{0.125}{}^{\circ}$ grid (except for NASA G5NRChem, which was already in fine resolution, but only available for 1 year). Figure 3a shows the surface ozone metric which results from the simple ensemble average of the six models. It was generated from bilinear interpolation of the ozone metric on the standard output grid, by calculating the same metric for each grid cell in each year, averaging over 2008–2014 and then averaging over the six models. We refer to this product as the “multimodel mean”, and we use it to validate our final product, which should outperform not only each individual model but also the multimodel mean.
Averaging all six models captures the largescale variations of the ozone distribution; however, many regions in northern midlatitudes and low latitudes are biased high compared to the observations in the TOAR database. A simple approach to address the uncertainty in the multimodel mean is to calculate the standard deviation for each grid cell from the different models, as shown in Fig. 3b. Higher model uncertainties across south Africa and the Middle East match the pattern of the interpolation uncertainty in Fig. 2b, and lower model uncertainties occur in regions with dense station networks. These findings suggest that the multimodel mean uncertainty can also reflect the current limited understanding of surface ozone in regions with limited or no observations.
It should be noted that the spatially interpolated observations are smoother in regions with fewer sites and reveal a more detailed structure in regions with a dense station network. In contrast, the multimodel mean is more noisy. Even though we average across multiple years and multiple models, the resulting ozone metric can still be noisy because it is calculated at each grid cell independently. In order to make maximum use of the skill of each model, we restrict the model evaluation to the regional scale in the next section.
3.2 Regional model evaluation and multimodel composite
To evaluate the performance of each model in a given region, we calculate the mean differences over all grid cells within the region and summarize them with the root mean square error (RMSE). Let $\widehat{y}\left({s}_{\mathrm{g}}\right)$ be the spatially interpolated observations and $\mathit{\left\{}{\mathit{\eta}}_{k}\right({s}_{\mathrm{g}});k=\mathrm{1},\phantom{\rule{0.125em}{0ex}}\mathrm{\dots},\mathrm{6}\mathit{\}}$ be the output corresponding to the six ensemble models considered in this paper; then, the (normalized) RMSE is given by
where n is the number of grid cells in a given region r. The first part of Table 2 shows the RMSE statistics for each model by region. The reliability of such an evaluation is limited by the station density in a given region, with greater reliability in a dense network (e.g., USA) and less reliability in a sparse network (e.g., Africa, South America or Australia). On average, CHASER, GEOSCCM and G5NRChem have the lowest biases in multiple regions; GFDLAM3 and MRIESM1r1 also show low mean biases in certain regions, such as America and Europe. However, larger model biases can be found in Africa, and east and south Asia.
Next, we select three regions with extensive monitoring: North America, Europe and east Asia. Figure 4 shows the differences between the spatially interpolated observations and model output in North America. A consistent underestimation can be found in the Mexico City region for all models. A clear overestimation is also found across much of the eastern US, as well as the western US and Canada, except for CHASER, which shows a mild underestimation in these regions; in Europe (Fig. 5), the models show mild levels of overestimation across most of the region, especially for Italy. In east Asia (Fig. 6), the models show a major bias across east China and a similar bias pattern across the entire region, although the bias amplitude is smaller for GEOSCCM. However, since the observations are relatively sparse in mainland China, the large scale of these estimated biases might be an interpolation artifact.
We argue that the credibility of the model is not entirely decided by the RMSE (i.e., the mean difference): the smoother the difference plots, the easier it is to carry out the model bias correction. Indeed, the observations and model output are not expected to match point by point. We should also expect the model to capture the general pattern of the spatial distribution, rather than a pointwise agreement.
The estimated weights from the constrained least squares (Eq. 1) are given in the second part of Table 2. Due to fixed underlying spatial structures, this approach tends to give greater weight to a single model (i.e., ≥50 %); the one which provides the best match between its spatial structure and the observational field (e.g., G5NRChem in North America). Note that this approach disfavors noisy spatial structure; therefore, the algorithm gives low weights to MOCAGE, for several reasons. First, the MOCAGE ozone field has not been smoothed by interpolation since it is already produced on the MOCAGE model grid, whereas all other models are interpolated. Secondly, MOCAGE uses a more complete tropospheric chemical scheme with a larger range of species (77 tropospheric species) and has generally a higher reactivity compared to most chemistry–climate models (CCMs) (Voulgarakis et al., 2013). Thus, it tends to provide more temporal and spatial variability. Note that our optimization algorithm estimates the weights according to the similarity of the spatial structures between the interpolated surface and each model. In regions with sparse monitoring, the kriged surface can be greatly affected by a few scattered stations; therefore, we cannot use the resulting weights to evaluate the actual model performance in these regions.
The last column of Table 2 shows the averaged and combined RMSEs from the equal weights and the constrained weights. A reduced overall bias can be generally achieved from the constrained weights. This approach suggests that even if a model has a large mean error (e.g., GFDLAM3), it can still be a good simulation if it produces a spatial pattern and curvature similar to the observation field. A constant offset α_{r} in the optimization Eq. (1) is included to remove the overall bias over each region, such that the residuals from the optimization have a zero mean. On the other hand, if we do not include α_{r} in the equation, GFDLAM3 will have a smaller weight in the optimization, and CHASER, GEOSCCM and G5NRChem will dominate most of these regions (not shown).
We combine all models according to the optimum weights from each region for each model. Figure 7a shows a map of the multimodel composite, a weighted blend of the six models, with the weighting calculated separately for each continent. Models with greater simulation skill receive higher weighting. The result reveals a systematic adjustment to the largescale overestimation from the ensemble mean in Fig. 3a. This demonstration of a general high bias among the models argues against using the simple ensemble model mean for estimating surface ozone. However, when compared to the TOAR observations, the multimodel composite still has clear local biases.
3.3 Local bias correction
The last step of producing the final fused surface ozone product is to apply a bias correction to our multimodel composite, limited to just those areas in close proximity to ozone observations. Ideally, we would like to apply a bias correction according to raw observations, but most stations are not exactly located on the model grid coordinates (even at $\mathrm{0.125}{}^{\circ}\times \mathrm{0.125}{}^{\circ}$ resolution). Therefore, to carry out a statistical bias correction on a particular grid, we need to consider the number of nearby stations and the distance to each station. All these considerations aim to deduce a single correction value on a single grid, and thus we are still faced with implementing statistical interpolation. To avoid adding another level of complexity, we set the final fused product to be exactly equal to the spatially interpolated ozone field within 2^{∘} of an observation, as the spatially interpolated ozone field has already accounted for all observations. Due to the global sparseness of observations, about 85 % of model grid cells over land were not affected by this bias correction. After bias correcting the multimodel composite grid cells within 2^{∘} of a TOAR observation site, an immediate benefit is seen for the US, Mexico City, Italy and South Korea (see Fig. 7b).
The choice of the correction range, in this case 2^{∘}, is a ad hoc decision; we also present results with different correction ranges in Figs. S3 and S4. When the radius of influence of the TOAR observations is increased to 5^{∘} or more, the greatest impact is seen for the Mexico City region and eastern China. An increase of correction range is not ideal because it extrapolates the Mexico City ozone values into the less populated regions of Mexico. Increasing the radius to 5^{∘} or more does not improve upon the RMSE associated with 2^{∘}. Therefore, accepting the 2^{∘} bias correction over other ranges is subjective.
The fused product can be evaluated in terms of spatial correlation using the variogram which assumes that spatial correlation is not a function of absolute location but only a function of distance (i.e., stationarity). Since spatial variability and continuity from the models are the result of geophysical processes represented by mathematical equations, the variogram must be customized for each field. In addition, the extremely large size of the model output prohibits us from carrying out a standard empirical variogram analysis, which requires calculating the variance of the difference between all pairwise grid cells.
Nevertheless, we provide examples of omnidirectional variograms for the spatial field in North America from each model and product in Fig. S5. The standard variogram analysis focuses on the following three parameters: (1) the nugget (variance at zero distance, which represents a subgrid variation), which is similar for all cases; (2) the sill (total variance of a field), where the variogram value reaches a maximum and levels off; (3) the range (a distance where the sill is reached, and beyond that there is no longer spatial correlation). Note that a continuously increasing variogram indicates the evidence of nonstationarity in the field, which is the case for SPDE, an issue that we have accounted for. The variogram peak is about 35–40^{∘} for the models. The result is very similar for G5NRChem, GEOSCCM and GFDLAM3, while CHASER and MRIESM show a larger variance in the spatial field. The reason is that the latter two models produce low ozone in the highlatitude region over Canada (see Fig. S1), but the former three models simulate relatively higher ozone in the same region, and this difference is reflected by the total variance. Even though North America has one of the most extensive monitoring networks in the world, some of the remote areas (mostly in Canada) are mainly described by the model output in the final fused product. Therefore, the variogram of the fused product is likely adjusted toward the remote areas of Canada as simulated by G5NRChem, which provided the largest weighting in North America).
3.4 Validation of the results
Since the raw observations are the only reliable source for validating our results, we align each model grid to observed locations for evaluating the predictive performance. The RMSEs of the residuals from all observations in 2008–2014 are displayed in Table 3. Note that, since the global network of monitoring stations is heavily weighted by North America, Europe, South Korea and Japan, these numbers are not representative of the sparsely monitored regions. We compare the fused surface ozone results to the simple multimodel mean from all six models. Our interim product, i.e., the multimodel composite, is also compared in Table 3.
Our multimodel composite outperforms the multimodel mean in terms of lowest mean predicted error. Based on the spatially interpolated observations, the resulting multimodel composite takes advantage of the strengths of each model and achieves a better accuracy. This result proves that our approach is effective, since our interim product has already improved upon the simple multimodel mean. The bias correction further reduces the residuals: this is expected because the spatial kriging algorithm is designed to minimize the difference to observations; thus, it has the lowest RMSE (this value is the same for the kriging result and the fused product since we apply the correction based on observed locations). The RMSE of approximately 5 ppb may represent the interannually varying meteorological influence during the years 2008–2014. If this is the case, then 5 ppb may approximate the minimal RMSE that can be achieved in a multiyear analysis.
In summary, the simple multimodel mean method may perform fairly well at the continental or regional scale but does not provide an accurate representation of the subregional structure; this is of course a limitation on the use of coarse model resolutions. The weighting applied during the construction of the multimodel composite improved the accuracy but the effect could be limited, because many smallscale processes are not (yet) resolved by the models. To alleviate the discrepancy further, a statistical method based on local observations is applied to correct the bias. The advantage of our fused surface ozone product over the simple multimodel mean can be clearly seen in Fig. 8. When interpreting the fused product, the reader should consider the following: (1) for a region with an extensive monitoring network, such as the US, a detailed bias correction can be achieved. We can utilize the observations to accurately reflect many local features (i.e., subgrid variations) as shown in the ozone pollution hotspots of southern California and Mexico City. However, it should be noted that this improvement is due to local bias correction instead of model weighting. (2) For regions with large observational gaps, such as South America, Africa or Russia, the spatial difference between the fused product and the multimodel mean is rather featureless, because the model weighting can only adjust the overall regional mean according to a few monitoring sites and cannot address the local variations. Filling large data gaps with the intermediate multimodel composite can indeed avoid the influence of preferential sampling (Diggle et al., 2010; Shaddick and Zidek, 2014), but it is still subject to a high uncertainty due to lack of data.
In this article, we present a flexible framework to incorporate observations and multiple models for providing an improved estimate of the global surface ozone distribution. Combining multivariate spatial fields in the estimation of ozone distribution is an extension of both the conventional multimodel ensemble approach (i.e., simple average) and a statistical bias correction approach, and was found to improve the prediction of surface ozone. In summary, our approach has the following properties:

The multiyear average enables us to reduce the meteorological influence on surface ozone. An extension of this method to timeresolved multiannual fields can be expected to capture the interannual variability (Shaddick and Zidek, 2015); however, such an endeavor would be highly computationally demanding in such a fineresolution setting.

The INLASPDE interpolation framework allows for modeling of potential nonstationarity in the spatial processes.

Regional model evaluation facilitates a feature selection for multiple competing atmospheric models.

Local bias correction of the multimodel composite only at a limited range of grid cells avoids using the spatially interpolated ozone field in regions associated with higher levels of uncertainty.

For the regions with dense monitoring networks (such as North American, Europe, South Korea and Japan), the final fused product was obtained mainly from the interpolation of observations; elsewhere, the final product relied on the multimodel composite through an optimized weight from each model.
Human health studies typically adopt a fine grid resolution, such as a $\mathrm{0.1}{}^{\circ}\times \mathrm{0.1}{}^{\circ}$ grid product, for matching to the gridded world population database. Even though the spatial kriging surrogate can produce the predicted value at any resolution, the accuracy of the fused surface ozone product is still limited by the density of observations around that point and by the resolution of the global model output. Regarding future improvements, two key developments can be expected to yield a better estimation of the global surface ozone distribution: firstly, we can include more simulators for increased leverage. Another way to increase the estimation accuracy is to expand ozone monitoring networks across sparsely sampled regions (Sofen et al., 2016; Schultz et al., 2017; Weatherhead et al., 2017).
The application of our methodology focuses on, but is not limited to, a particular ozone metric relevant for quantifying the impact of longterm ozone exposure on human health. We expect that this framework could also be applied to other ozone metrics relevant to crop production or natural vegetation (Lefohn et al., 2018; Mills et al., 2018), or any other trace gas, provided adequate in situ observations are available for model evaluation.
In general, atmospheric chemistry model estimates of surface ozone levels are biased high, as demonstrated by a comparison of the annual mean surface ozone produced by the ACCMIP (Atmospheric Chemistry and Climate Model Intercomparison Project) multimodel ensemble to the TOAR surface ozone database (see Fig. 6 of Young et al., 2018). This analysis has shown that the high bias is also prevalent among models when employing an ozone metric that focuses on the high end of the ozone distribution (Fig. 8). Similarly, Shindell et al. (2018) compared the NASA GISSE2 model to observed values of annual mean DMA8 and concluded that the model was biased high by 25 %. Given the common tendency for models to overestimate surface ozone, the methodology developed by this paper can be used to improve the accuracy of model output, either for individual models or for multimodel ensembles.
The sources of the TOAR data and the output from four CCMI models are listed in Sect. 2.1; the output from the GFDLAM3 model is archived at GFDL and is available to the public upon request to Meiyun Lin; G5NRChem model output is available for download at https://portal.nccs.nasa.gov/datashare/G5NRChem/Heracles/12.5km/DATA (NCCS, 2019) or can be accessed through the OpenDAP framework at the portal https://opendap.nccs.nasa.gov/dods/OSSE/G5NRChem/Heracles/12.5km (last assess: 28 February 2019). All computations in our methodology are implemented in R (R Core Team, 2013). The relevant code can be found in R packages for statistical interpolation (RINLA; Lindgren and Rue, 2015), quadratic programming (limSolve) and spline smoothing (mgcv; Wood, 2017). The R code accompanies this paper on its associated GMD web page.
In this paper, the aim of spatial interpolation is to use (discretized) monitoring observations to build a statistical surrogate model for estimating the ozone distribution over the whole domain on a sphere. We assume that this ozone distribution follows a Gaussian process (GP). A GP is a collection of random variables such that any subset of the observations has a joint Gaussian distribution. It has been widely used in many applications as a machine learning algorithm (Rasmussen and Williams, 2006). In this section, we briefly introduce the GP model with a focus on spatial kriging. The GP is a popular choice in spatial statistics because it allows modeling of fairly complicated functional forms, and it also provides a prediction and associated uncertainty at any new location. A common limitation of this interpolation is that the resulting distribution of estimated uncertainty will be lower around individual stations or within dense monitoring networks, and higher in sparsely monitored regions.
Let Y denote an n vector of ozone observations measured at monitoring sites s; then a statistical model for the spatial field can be expressed as $Y=f\left(\mathit{s}\right)+\mathit{\epsilon}$; i.e., the model comprises a smooth GP spatial process f(s), capturing spatial association, and an independent normal error ε, which follows a normal error N(0,σ^{2}). This error term can accommodate potential measurement error; on the other hand, kriging without measurement error is usually used for the surrogate of a deterministic model (i.e., the same input always produces the same output), also known as an emulator (e.g., Conti and O'Hagan, 2010).
The specification of a GP is through its mean function and covariance function, denoted by $f\left(\mathit{s}\right)\sim \text{GP}\left(m\left(\mathit{s}\right),c(\mathit{s},{\mathit{s}}^{\prime})\right)$. To reduce computational intensity, the mean function can be assumed to be a constant m(s)=μ; thus, the resulting spatial distribution is completely defined by the covariance function. A covariance function characterizes correlations between different locations in the spatial process; it is the crucial component in a GP, as it represents our assumptions about the latent field from which we wish to build a surrogate. Specifically, we use the Matérn covariance function, which is a flexible covariance structure and widely used in spatial statistics (Hoeting et al., 2006; Jun and Stein, 2007, 2008). With the shape parameter ν>0, the scale parameter κ>0 and the marginal precision τ^{2}>0, the covariance structure can be written as
where h denotes the distance between any two locations: $\mathit{h}=\mathit{s}{\mathit{s}}^{\prime}$, Γ is a gamma function, and K_{ν} is the modified Bessel function of the second kind of order ν>0. The scale parameter κ controls the rate of decay of the correlation between two locations as distance increases. Smaller values of κ allow for longer ranges over which two sites can be correlated. The smoothness parameter ν can be seen as the determining behavior of the autocorrelation for observations that are separated by a small distance.
The major disadvantage of using a GP is the computational complexity, which typically involves a cubic complexity in the number of data points, usually denoted as O(n^{3}). Several attempts have been made to reduce the computational burden: e.g., Cressie and Johannesson (2008), Rue et al. (2009), Banerjee et al. (2012) and Gramacy and Apley (2015). Lindgren et al. (2011) introduced a popular approach in which the Matérn covariance can be approximated by the solution of certain stochastic partial differential equations (SPDEs). According to Lindgren et al. (2011), a GP process f(s) with Matérn covariance on a sphere is the solution of the following stationary SPDE:
where Δ is the Laplace operator and 𝒲 is the Gaussian white noise. The core implication of this mathematical relationship is that an efficient algorithm for solving this SPDE can be applied to approximate the GP (Lindgren et al., 2011).
This INLASPDE technique also enables us to quantify the level of nonstationarity in a spatial field by employing basis function representations for both κ and τ (i.e., these quantities are constants in a stationary field). To obtain basic identifiability, κ(s) and τ(s) are taken to be positive, and their logarithm can be represented as
where {ψ_{k}(s)} is a set of spherical harmonics. The coefficients $\mathit{\left\{}{\mathit{\theta}}_{k}^{\mathit{\kappa}}\mathit{\right\}}$ and $\mathit{\left\{}{\mathit{\theta}}_{k}^{\mathit{\tau}}\mathit{\right\}}$ represent local variances and correlation ranges (Bolin and Lindgren, 2011; Lindgren et al., 2011). A larger number of basis functions permits the representation of smaller local features.
We now illustrate a series of statistical model fits to select the best predictive ability of the SPDE model. To choose the maximum number of basis functions for the parameters κ and τ in Eq. (A1), model selection techniques must be used. We perform the model selection based on the following criteria:

RMSE is the measure of the overall mean difference between predicted values and the observed values.

DIC (deviance information criterion) is a measure to compare performance of statistical models by using a criterion based on a tradeoff between the goodness of fit and the corresponding complexity of the model. Smaller values of the DIC indicate a better balance between complexity and a good fit.

GCV (generalized cross validation) calculates the mean residuals in a leaveoneout test. The model that minimizes the average predicted residuals over all the data is selected as the best model (Schneider, 2001).
We estimate nine statistical models with different numbers of basis functions, presented in Table A1. The simplest model is a stationary Matérn model (we use basis number 0 to represent κ and τ as constants). The best fit of all criteria occurs when the orders of the basis functions are increased from four to five. We therefore conclude that a model with five spatially varying basis functions is most appropriate for the TOAR observations.
The supplement related to this article is available online at: https://doi.org/10.5194/gmd129552019supplement.
KLC, ORC, JJW and MLS contributed to conception and design. ORC and MGS contributed to the acquisition of data. All authors contributed to the analysis and interpretation of data. KLC and ORC drafted the article, while all authors helped with the revision. All authors approved the submitted and revised versions for publication.
The authors declare that they have no conflict of interest.
This work was funded by the NASA Health and Air Quality Applied Sciences Team
(grant no. NNX16AQ80G).
Edited by: Tim Butler
Reviewed by: two anonymous referees
Adachi, Y., Yukimoto, S., Deushi, M., Obata, A., andTaichu. Y. Tanaka, H. N., Hosaka, M., Sakami, T., Yoshimura, H., Hirabara, M., Shindo, E., Tsujino, H., Mizuta, R., Yabu, S., Koshiro, T., Ose, T., and Kitoh, A.: Basic performance of a new earth system model of the Meteorological Research Institute (MRIESM1), Pap. Meteorol. Geophys, 64, 1–18, https://doi.org/10.2467/mripapers.64.1, 2013. a
Anenberg, S. C., Horowitz, L. W., Tong, D. Q., and West, J. J.: An estimate of the global burden of anthropogenic ozone and fine particulate matter on premature human mortality using atmospheric modeling, Environ. Health Persp., 118, 1189–1195, https://doi.org/10.1289/ehp.0901220, 2010. a
Banerjee, A., Dunson, D. B., and Tokdar, S. T.: Efficient Gaussian process regression for large datasets, Biometrika, 100, 75–89, https://doi.org/10.1093/biomet/ass068, 2012. a, b
Berrocal, V. J., Gelfand, A. E., and Holland, D. M.: Spacetime data fusion under error in computer model output: An application to modeling air quality, Biometrics, 68, 837–848, https://doi.org/10.1111/j.15410420.2011.01725.x, 2012. a
Bolin, D. and Lindgren, F.: Spatial models generated by nested stochastic partial differential equations, with an application to global ozone mapping, Ann. Appl. Stat., 5, 523–550, https://doi.org/10.1214/10AOAS383, 2011. a, b
Brauer, M., Amann, M., Burnett, R. T., Cohen, A., Dentener, F., Ezzati, M., Henderson, S. B., Krzyzanowski, M., Martin, R. V., Dingenen, R. V., van Donkelaar, A., and Thurston, G. D.: Exposure assessment for estimation of the global burden of disease attributable to outdoor air pollution, Environ. Sci. Technol., 46, 652–660, https://doi.org/10.1021/es2025752, 2012. a, b, c
Brauer, M., Freedman, G., Frostad, J., van Donkelaar, A., Martin, R. V., Dentener, F., van Dingenen, R., Estep, K., Amini, H., Apte, J. S., Balakrishnan, K., Barregardh, L., Broday, D., Feigin, V., Ghosh, S., Hopke, P. K., Knibbs, L. D., Kokubo, Y., Liu, Y., Ma, S., Morawska, L., Sangrador, J. L. T., Shaddick, G., Anderson, H. R., Vos, T., Forouzanfar, M. H., Burnett, R. T., and Cohen, A.: Ambient air pollution exposure estimation for the global burden of disease 2013, Environ. Sci. Technol., 50, 79–88, https://doi.org/10.1021/acs.est.5b03709, 2015. a, b, c
Braverman, A., Chatterjee, S., Heyman, M., and Cressie, N.: Probabilistic evaluation of competing climate models, Adv. Stat. Clim. Meteorol. Oceanogr., 3, 93–105, https://doi.org/10.5194/ascmo3932017, 2017. a
Brynjarsdóttir, J. and O'Hagan, A.: Learning about physical parameters: The importance of model discrepancy, Inverse Probl., 30, 114007, https://doi.org/10.1088/02665611/30/11/114007, 2014. a
Buser, C. M., Künsch, H. R., Lüthi, D., Wild, M., and Schär, C.: Bayesian multimodel projection of climate: bias assumptions and interannual variability, Clim. Dynam., 33, 849–868, https://doi.org/10.1007/s0038200905886, 2009. a
Cameletti, M., Lindgren, F., Simpson, D., and Rue, H.: Spatiotemporal modeling of particulate matter concentration through the SPDE approach, AStA Adv. Stat. Anal., 97, 109–131, https://doi.org/10.1007/s1018201201963, 2013. a
Cariolle, D. and Teyssèdre, H.: A revised linear ozone photochemistry parameterization for use in transport and general circulation models: multiannual simulations, Atmos. Chem. Phys., 7, 2183–2196, https://doi.org/10.5194/acp721832007, 2007. a
CEDA: Centre for Environmental Data Analysis: CCMI archive, available at: http://data.ceda.ac.uk/badc/wcrpccmi/data/CCMI1/output/, last access: 28 February 2019. a
Chandler, R. E.: Exploiting strength, discounting weakness: combining information from multiple climate simulators, Philos. T. R. Soc. A, 371, 20120388, https://doi.org/10.1098/rsta.2012.0388, 2013. a
Chang, K.L. and Guillas, S.: Computer model calibration with large nonstationary spatial outputs: application to the calibration of a climate model, J. Roy. Stat. Soc. CAppl., 68, 51–78, https://doi.org/10.1111/rssc.12309, 2019. a
Chang, K.L., Guillas, S., and Fioletov, V. E.: Spatial mapping of groundbased observations of total ozone, Atmos. Meas. Tech., 8, 4487–4505, https://doi.org/10.5194/amt844872015, 2015. a
Chang, K.L., Petropavlovskikh, I., Cooper, O. R., Schultz, M. G., and Wang, T.: Regional trend analysis of surface ozone observations from monitoring networks in eastern North America, Europe and East Asia, Elementa, 5, p. 50, https://doi.org/10.1525/elementa.243, 2017. a
Cohen, A. J., Brauer, M., Burnett, R., Anderson, H. R., Frostad, J., Estep, K., Balakrishnan, K., Brunekreef, B., Dandona, L., Dandona, R., Feigin, V., Freedman, G., Hubbell, B., Jobling, A., Kan, H., Knibbs, L., Liu, Y., Martin, R., Morawska, L., III, C. A. P., Shin, H., Straif, K., Shaddick, G., Thomas, M., van Dingenen, R., van Donkelaar, A., Vos, T., Murray, C. J. L., and Forouzanfar, M. H.: Estimates and 25year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015, The Lancet, 389, 1907–1918, https://doi.org/10.1016/S01406736(17)305056, 2017. a
Conti, S. and O'Hagan, A.: Bayesian emulation of complex multioutput and dynamic computer models, J. Stat. Plan. Infer., 140, 640–651, https://doi.org/10.1016/j.jspi.2009.08.006, 2010. a
Cooper, O. R., Parrish, D. D., Ziemke, J. R., Balashov, N. V., Cupeiro, M., Galbally, I. E., Gilge, S., Horowitz, L., Jensen, N. R., Lamarque, J.F., Naik, V., Oltmans, S. J., Schwab, J., Shindell, D. T., Thompson, A. M., Thouret, V., Wang, Y., and Zbinden, R. M.: Global distribution and trends of tropospheric ozone: An observationbased review, Elementa, 2, p. 000029, https://doi.org/10.12952/journal.elementa.000029, 2014. a
Cressie, N. and Johannesson, G.: Fixed rank kriging for very large spatial data sets, J. Roy. Stat. Soc. B, 70, 209–226, https://doi.org/10.1111/j.14679868.2007.00633.x, 2008. a, b
Diggle, P. J., Menezes, R., and Su, T.l.: Geostatistical inference under preferential sampling, J. Roy. Stat. Soc. CApp., 59, 191–232, 2010. a
Fleming, Z. L., Doherty, R. M., von Schneidemesser, E., Malley, C. S., Cooper, O. R., Pinto, J. P., Colette, A., Xu, X., Simpson, D., Schultz, M. G., Lefohn, A. S., Hamad, S., Moolla, R., Solberg, S., and Feng, Z.: Tropospheric Ozone Assessment Report: Presentday ozone distribution and trends relevant to human health, Elementa, 6, p. 12, https://doi.org/10.1525/elementa.273, 2018. a
Fuentes, M. and Raftery, A. E.: Model evaluation and spatial interpolation by Bayesian combination of observations with outputs from numerical models, Biometrics, 61, 36–45, https://doi.org/10.1111/j.0006341X.2005.030821.x, 2005. a
Furrer, R. and Sain, S. R.: Spatial model fitting for large datasets with applications to climate and microarray problems, Stat. Comput., 19, 113–128, https://doi.org/10.1007/s112220089075x, 2009. a
Gaudel, A., Cooper, O. R., Ancellet, G., Barret, B., Boynard, A., Burrows, J. P., Clerbaux, C., Coheur, P. F., Cuesta, J., Cuevas, E., Doniki, S., Dufour, G., Ebojie, F., Foret, G., Garcia, O., Muños, M. J. G., Hannigan, J. W., Hase, F., Huang, G., Hassler, B., Hurtmans, D., Jaffe, D., Jones, N., Kalabokas, P., Kerridge, B., Kulawik, S. S., Latter, B., Leblanc, T., Flochmoën, E. L., Lin, W., Liu, J., Liu, X., Mahieu, E., McClureBegley, A., Neu, J. L., Osman, M., Palm, M., Petetin, H., Petropavlovskikh, I., Querel, R., Rahpoe, N., Rozanov, A., Schultz, M. G., Schwab, J., Siddans, R., Smale, D., Steinbacher, M., Tanimoto, H., Tarasick, D. W., Thouret, V., Thompson, A. M., Trickl, T., Weatherhead, E. C., Wespes, C., Worden, H. M., Vigouroux, C., Xu, X., Zeng, G., and Ziemke, J. R.: Tropospheric Ozone Assessment Report: Presentday distribution and trends of tropospheric ozone relevant to climate and global atmospheric chemistry model evaluation, Elementa, 6, p. 39, https://doi.org/10.1525/elementa.291, 2018. a
GBD: Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks in 188 countries, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013, The Lancet, 386, 2287–2323, https://doi.org/10.1016/S01406736(15)001282, 2015. a
Gelfand, A. E. and Sahu, S. K.: Combining monitoring data and computer model output in assessing environmental exposure, in: Handbook of Applied Bayesian Analysis, Oxford University Press, Oxford, UK, 482–510, 2010. a
Gotway, C. A. and Young, L. J.: Combining incompatible spatial data, J. Am. Stat. Assoc., 97, 632–648, https://doi.org/10.1198/016214502760047140, 2002. a
Gramacy, R. B. and Apley, D. W.: Local Gaussian process approximation for large computer experiments, J. Comput. Graph. Stat., 24, 561–578, https://doi.org/10.1080/10618600.2014.914442, 2015. a
Guillas, S., Tiao, G. C., Wuebbles, D. J., and Zubrow, A.: Statistical diagnostic and correction of a chemistrytransport model for the prediction of total column ozone, Atmos. Chem. Phys., 6, 525–537, https://doi.org/10.5194/acp65252006, 2006. a
He, Y. and Xiu, D.: Numerical strategy for model correction using physical constraints, J. Comput. Phys., 313, 617–634, https://doi.org/10.1016/j.jcp.2016.02.054, 2016. a
Heath, A., Manolopoulou, I., and Baio, G.: Estimating the expected value of partial perfect information in health economic evaluations using integrated nested Laplace approximation, Stat. Med., 35, 4264–4280, https://doi.org/10.1002/sim.6983, 2016. a
Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and ZammitMangion, A.: A case study competition among methods for analyzing large spatial data, J. Agric. Biol. Envir. S., 1–28, https://doi.org/10.1007/s1325301800348w, online first, 2018. a
Hoeting, J. A., Davis, R. A., Merton, A. A., and Thompson, S. E.: Model selection for geostatistical models, Ecol. Appl., 16, 87–98, https://doi.org/10.1890/040576, 2006. a, b
Hu, L., Keller, C. A., Long, M. S., Sherwen, T., Auer, B., Da Silva, A., Nielsen, J. E., Pawson, S., Thompson, M. A., Trayanov, A. L., Travis, K. R., Grange, S. K., Evans, M. J., and Jacob, D. J.: Global simulation of tropospheric chemistry at 12.5 km resolution: performance and evaluation of the GEOSChem chemical module (v101) within the NASA GEOS Earth system model (GEOS5 ESM), Geosci. Model Dev., 11, 4603–4620, https://doi.org/10.5194/gmd1146032018, 2018. a, b, c
Hyde, R., Hossaini, R., and Leeson, A. A.: Clusterbased analysis of multimodel climate ensembles, Geosci. Model Dev., 11, 2033–2048, https://doi.org/10.5194/gmd1120332018, 2018. a, b
Jerrett, M., Burnett, R. T., Pope III, C. A., Ito, K., Thurston, G., Krewski, D., Shi, Y., Calle, E., and Thun, M.: Longterm ozone exposure and mortality, N. Engl. J. Med., 360, 1085–1095, https://doi.org/10.1056/NEJMoa0803894, 2009. a
Josse, B., Simon, P., and Peuch, V.H.: Radon global simulations with the multiscale chemistry and transport model MOCAGE, Tellus B, 56, 339–356, https://doi.org/10.1111/j.16000889.2004.00112.x, 2004. a
Jun, M. and Stein, M. L.: Statistical comparison of observed and CMAQ modeled daily sulfate levels, Atmos. Environ., 38, 4427–4436, https://doi.org/10.1016/j.atmosenv.2004.05.019, 2004. a
Jun, M. and Stein, M. L.: An approach to producing space–time covariance functions on spheres, Technometrics, 49, 468–479, https://doi.org/10.1198/004017007000000155, 2007. a
Jun, M. and Stein, M. L.: Nonstationary covariance models for global data, Ann. Appl. Stat., 2, 1271–1289, https://doi.org/10.1214/08AOAS183, 2008. a
Jun, M., Knutti, R., and Nychka, D. W.: Spatial analysis to quantify numerical model bias and dependence: how many climate models are there?, J. Am. Stat. Assoc., 103, 934–947, https://doi.org/10.1198/016214507000001265, 2008. a
Kammann, E. and Wand, M. P.: Geoadditive models, J. Roy. Stat. Soc. CApp., 52, 1–18, https://doi.org/10.1111/14679876.00385, 2003. a
Kennedy, M. C. and O'Hagan, A.: Bayesian calibration of computer models, J. Roy. Stat. Soc. B, 63, 425–464, https://doi.org/10.1111/14679868.00294, 2001. a
Knutti, R., Furrer, R., Tebaldi, C., Cermak, J., and Meehl, G. A.: Challenges in combining projections from multiple climate models, J. Climate, 23, 2739–2758, https://doi.org/10.1175/2009JCLI3361.1, 2010. a
Lefohn, A. S., Malley, C. S., Smith, L., Wells, B., Hazucha, M., Simon, H., Naik, V., Mills, G., Schultz, M. G., Paoletti, E., De Marco, A., Xu, X., Zhang, L., Wang, T., Neufeld, H. S., Musselman, R. C., Tarasick, D., Brauer, M., Feng, Z., Tang, H., Kobayashi, K., Sicard, P., Solberg, S., and Gerosa, G.: Tropospheric ozone assessment report: Global ozone metrics for climate change, human health, and crop/ecosystem research, Elementa, 6, p. 28, https://doi.org/10.1525/elementa.279, 2018. a
Liang, F., Cheng, Y., Song, Q., Park, J., and Yang, P.: A resamplingbased stochastic approximation method for analysis of large geostatistical data, J. Am. Stat. Assoc., 108, 325–339, 2013. a
Lin, M., Fiore, A. M., Horowitz, L. W., Cooper, O. R., Naik, V., Holloway, J., Johnson, B. J., Middlebrook, A. M., Oltmans, S. J., Pollack, I. B., Ryerson, T. B., Warner, J. X., Wiedinmyer, C., Wilson, J., and Wyman, B.: Transport of Asian ozone pollution into surface air over the western United States in spring, J. Geophys. Res., 117, D00V07, https://doi.org/10.1029/2011JD016961, 2012. a
Lin, M., Horowitz, L. W., Oltmans, S. J., Fiore, A. M., and Fan, S.: Tropospheric ozone trends at Mauna Loa Observatory tied to decadal climate variability, Nat. Geosci., 7, 136–143, https://doi.org/10.1038/NGEO2066, 2014. a
Lin, M., Horowitz, L. W., Payton, R., Fiore, A. M., and Tonnesen, G.: US surface ozone trends and extremes from 1980 to 2014: quantifying the roles of rising Asian emissions, domestic controls, wildfires, and climate, Atmos. Chem. Phys., 17, 2943–2970, https://doi.org/10.5194/acp1729432017, 2017. a
Lindgren, F. and Rue, H.: Bayesian spatial and spatiotemporal modelling with RINLA, J. Stat. Softw., 63, 1–25, https://doi.org/10.18637/jss.v063.i19, 2015. a, b
Lindgren, F., Rue, H., and Lindström, J.: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach, J. Roy. Stat. Soc. B, 73, 423–498, https://doi.org/10.1111/j.14679868.2011.00777.x, 2011. a, b, c, d, e, f, g
Liu, X. and Guillas, S.: Dimension reduction for Gaussian process emulation: an application to the influence of bathymetry on tsunami heights, SIAM/ASA J. Uncertain. Quantif., 5, 787–812, https://doi.org/10.1137/16M1090648, 2017. a
Malley, C. S., Henze, D. K., Kuylenstierna, J. C., Vallack, H. W., Davila, Y., Anenberg, S. C., Turner, M. C., and Ashmore, M. R.: Updated global estimates of respiratory mortality in adults ≥30 years of age attributable to longterm ozone exposure, Environ. Health Persp., 125, 9 pp., https://doi.org/10.1289/EHP1390, 2017. a, b
Mills, G., Pleijel, H., Malley, C. S., Sinha, B., Cooper, O. R., Schultz, M. G., Neufeld, H. S., Simpson, D., Sharps, K., Feng, Z., Gerosa, G., Harmens, H., Kobayashi, K., Saxena, P., Paoletti, E., Sinha, V., and Xu, X.: Tropospheric Ozone Assessment Report: Presentday tropospheric ozone distribution and trends relevant to vegetation, Elementa, 6, p. 47, https://doi.org/10.1525/elementa.302, 2018. a
Morgenstern, O., Giorgetta, M. A., Shibata, K., Eyring, V., Waugh, D. W., Shepherd, T. G., Akiyoshi, H., Austin, J., Baumgaertner, A. J. G., Bekki, S., Braesicke, P., Brühl, C., Chipperfield, M., Cugnet, D., Dameris, M., Dhomse, S., Frith, S. M., Garny, H., Gettelman, A., Hardiman, S. C., Hegglin, M. I., Jöckel, P., Kinnison, D. E., Lamarque, J.F., Mancini, E., Manzini, E., Marchand, M., Michou, M., Nakamura, T., Nielsen, J. E., Olivié, D., Pitari, G., Plummer, D. A., Rozanov, E., Scinocca, J. F., Smale, D., Teyssèdre, H., Toohey, M., Tian, W., and Yamashita, Y.: Review of the formulation of presentgeneration stratospheric chemistryclimate models and associated external forcings, J. Geophys. Res., 115, D00M02, https://doi.org/10.1029/2009JD013728, 2010. a
Morgenstern, O., Hegglin, M. I., Rozanov, E., O'Connor, F. M., Abraham, N. L., Akiyoshi, H., Archibald, A. T., Bekki, S., Butchart, N., Chipperfield, M. P., Deushi, M., Dhomse, S. S., Garcia, R. R., Hardiman, S. C., Horowitz, L. W., Jöckel, P., Josse, B., Kinnison, D., Lin, M., Mancini, E., Manyin, M. E., Marchand, M., Marécal, V., Michou, M., Oman, L. D., Pitari, G., Plummer, D. A., Revell, L. E., SaintMartin, D., Schofield, R., Stenke, A., Stone, K., Sudo, K., Tanaka, T. Y., Tilmes, S., Yamashita, Y., Yoshida, K., and Zeng, G.: Review of the global models used within phase 1 of the Chemistry–Climate Model Initiative (CCMI), Geosci. Model Dev., 10, 639–671, https://doi.org/10.5194/gmd106392017, 2017. a, b
NASA Center for Climate Simulation (NCCS) Dataportal: https://portal.nccs.nasa.gov/datashare/G5NRChem/Heracles/12.5km/DATA/, Curator: Bill McHale, last access: 28 February 2019. a
Nguyen, H., Cressie, N., and Braverman, A.: Spatial statistical data fusion for remote sensing applications, J. Am. Stat. Assoc., 107, 1004–1018, https://doi.org/10.1080/01621459.2012.694717, 2012. a
Oman, L. D., Ziemke, J. R., Douglass, A. R., Waugh, D. W., Lang, C., Rodriguez, J. M., and Nielsen, J. E.: The response of tropical tropospheric ozone to ENSO, Geophys. Res. Lett., 38, L13706, https://doi.org/10.1029/2011GL047865, 2011. a
R Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2013. a
Rasmussen, C. E. and Williams, C. K. I.: Gaussian processes for machine learning, The MIT Press, Cambridge, MA, USA, 2006. a
Rue, H. and Held, L.: Gaussian Markov random fields: theory and applications, CRC Press, New York, USA, 2005. a
Rue, H., Martino, S., and Chopin, N.: Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations, J. Roy. Stat. Soc. B, 71, 319–392, https://doi.org/10.1111/j.14679868.2008.00700.x, 2009. a, b, c
Rue, H., Riebler, A., Sørbye, S. H., Illian, J. B., Simpson, D. P., and Lindgren, F. K.: Bayesian computing with INLA: a review, Annu. Rev. Stat. Appl., 4, 395–421, https://doi.org/10.1146/annurevstatistics060116054045, 2017. a
Sang, H. and Huang, J. Z.: A full scale approximation of covariance functions for large spatial data sets, J. Roy. Stat. Soc. B, 74, 111–132, https://doi.org/10.1111/j.14679868.2011.01007.x, 2012. a
Sang, H., Jun, M., and Huang, J. Z.: Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors, Ann. Appl. Stat., 5, 2519–2548, https://doi.org/10.1214/11AOAS478, 2011. a
Schneider, T.: Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values, J. Climate, 14, 853–871, 2001. a
Schultz, M. G., Schröder, S., Lyapina, O., Cooper, O. R., Galbally, I., Petropavlovskikh, I., von Schneidemesser, E., Tanimoto, H., Elshorbany, Y., Naja, M., Seguel, R., Dauert, U., Eckhardt, P., Feigenspahn, S., Fiebig, M., Hjellbrekke, A.G., Hong, Y.D., Kjeld, P. C., Koide, H., Lear, G., Tarasick, D., Ueno, M., Wallasch, M., Baumgardner, D., Chuang, M.T., Gillett, R., Lee, M., Molloy, S., Moolla, R., Wang, T., Sharps, K., Adame, J. A., Ancellet, G., Apadula, F., Artaxo, P., Barlasina, M., Bogucka, M., Bonasoni, P., Chang, L., Colomb, A., Cuevas, E., Cupeiro, M., Degorska, A., Ding, A., Fröhlich, M., Frolova, M., Gadhavi, H., Gheusi, F., Gilge, S., Gonzalez, M. Y., Gros, V., Hamad, S. H., Helmig, D., Henriques, D., Hermansen, O., Holla, R., Huber, J., Im, U., Jaffe, D. A., Komala, N., Kubistin, D., Lam, K.S., Laurila, T., Lee, H., Levy, I., Mazzoleni, C., Mazzoleni, L., McClureBegley, A., Mohamad, M., Murovic, M., NavarroComas, M., Nicodim, F., Parrish, D., Read, K. A., Reid, N., Ries, L., Saxena, P., Schwab, J. J., Scorgie, Y., Senik, I., Simmonds, P., Sinha, V., Skorokhod, A., Spain, G., Spangl, W., Spoor, R., Springston, S. R., Steer, K., Steinbacher, M., Suharguniyawan, E., Torre, P., Trickl, T., Weili, L., Weller, R., Xu, X., Xue, L., and Zhiqiang, M.: Tropospheric Ozone Assessment Report: Database and metrics data of global surface ozone observations, Elementa, 5, p. 58, https://doi.org/10.1525/elementa.244, 2017. a, b
Seltzer, K. M., Shindell, D. T., and Malley, C. S.: Measurementbased assessment of health burdens from longterm ozone exposure in the United States, Europe, and China, Environ. Res. Lett., 13, 104018, https://doi.org/10.1088/17489326/aae29d, 2018. a
Shaddick, G. and Zidek, J. V.: A case study in preferential sampling: Long term monitoring of air pollution in the UK, Spatial Statistics, 9, 51–65, 2014. a
Shaddick, G. and Zidek, J. V.: Spatiotemporal methods in environmental epidemiology, CRC Press, New York, USA, 2015. a, b
Shaddick, G., Thomas, M. L., Green, A., Brauer, M., Donkelaar, A., Burnett, R., Chang, H. H., Cohen, A., Dingenen, R. V., Dora, C., Gumy, S., Liu, Y., Martin, R., Waller, L. A., West, J. J., Zidek, J. V., and PrüssUstün, A.: Data integration model for air quality: a hierarchical approach to the global estimation of exposures to ambient air pollution, J. R. Stat. Soc. CAppl., 67, 231–253, https://doi.org/10.1111/rssc.12227, 2018. a, b
Shindell, D., Faluvegi, G., Seltzer, K., and Shindell, C.: Quantified, localized health benefits of accelerated carbon dioxide emissions reductions, Nature Climate Change, 8, 291–295, 2018. a, b
Sofen, E. D., Bowdalo, D., and Evans, M. J.: How to most effectively expand the global surface ozone observing network, Atmos. Chem. Phys., 16, 1445–1457, https://doi.org/10.5194/acp1614452016, 2016. a
Stainforth, D. A., Allen, M. R., Tredger, E. R., and Smith, L. A.: Confidence, uncertainty and decisionsupport relevance in climate predictions, Philos. T. Roy. Soc. A, 365, 2145–2161, https://doi.org/10.1098/rsta.2007.2074, 2007. a
Stein, M. L.: Interpolation of spatial data: in: Some theory for kriging, Springer Science and Business Media, New York, USA, 2012. a
Stevenson, D. S., Dentener, F. J., Schultz, M. G., Ellingsen, K., Noije, T. P. C. V., Wild, O., Zeng, G., Amann, M., Atherton, C. S., Bell, N., Bergmann, D. J., Bey, I., Butler, T., Cofala, J., Collins, W. J., Derwent, R. G., Doherty, R. M., Drevet, J., Eskes, H. J., Fiore, A. M., Gauss, M., Hauglustaine, D. A., Horowitz, L. W., Isaksen, I. S. A., Krol, M. C., Lamarque, J.F., Lawrence, M. G., Montanaro, V., Müller, J.F., Pitari, G., Prather, M. J., Pyle, J. A., Rast, S., Rodriguez, J. M., Sanderson, M. G., Savage, N. H., Shindell, D. T., Strahan, S. E., Sudo, K., and Szopa, S.: Multimodel ensemble simulations of presentday and nearfuture tropospheric ozone, J. Geophys. Res., 111, D08301, https://doi.org/10.1029/2005JD006338, 2006. a, b
Strode, S. A., Rodriguez, J. M., Logan, J. A., Cooper, O. R., Witte, J. C., Lamsal, L. N., Damon, M., Van Aartsen, B., Steenrod, S. D., and Strahan, S. E.: Trends and variability in surface ozone over the United States, J. Geophys. Res.Atmos., 120, 9020–9042, https://doi.org/10.1002/2014JD022784, 2015. a
Sudo, K., Takahashi, M., and Akimoto, H.: CHASER: A global chemical model of the troposphere, 2. Model results and evaluation, J. Geophys. Res., 107, 4586, https://doi.org/10.1029/2001JD001114, 2002a. a
Sudo, K., Takahashi, M., Kurokawa, J., and Akimoto, H.: CHASER: A global chemical model of the troposphere, 1. Model description, J. Geophys. Res., 107, 4339, https://doi.org/10.1029/2001JD001113, 2002b. a
Teyssèdre, H., Michou, M., Clark, H. L., Josse, B., Karcher, F., Olivié, D., Peuch, V.H., SaintMartin, D., Cariolle, D., Attié, J.L., Nédélec, P., Ricaud, P., Thouret, V., van der A, R. J., VolzThomas, A., and Chéroux, F.: A new tropospheric and stratospheric Chemistry and Transport Model MOCAGEClimat for multiyear studies: evaluation of the presentday climatology and sensitivity to surface processes, Atmos. Chem. Phys., 7, 5815–5860, https://doi.org/10.5194/acp758152007, 2007. a
Turner, M. C., Jerrett, M., Pope III, C. A., Krewski, D., Gapstur, S. M., Diver, W. R., Beckerman, B. S., Marshall, J. D., Su, J., Crouse, D. L., and Burnett, R. T.: Longterm ozone exposure and mortality in a large prospective study, Am. J. Resp. Crit. Care, 193, 1134–1142, https://doi.org/10.1164/rccm.2015081633OC, 2016. a, b, c
US Environmental Protection Agency: Integrated Science Assessment (ISA) for Ozone and Related Photochemical Oxidants, Office of Research and Development, Research Triangle Park, NC, EPA/600/R10/076F, 2013. a
Voulgarakis, A., Naik, V., Lamarque, J.F., Shindell, D. T., Young, P. J., Prather, M. J., Wild, O., Field, R. D., Bergmann, D., CameronSmith, P., Cionni, I., Collins, W. J., Dalsøren, S. B., Doherty, R. M., Eyring, V., Faluvegi, G., Folberth, G. A., Horowitz, L. W., Josse, B., MacKenzie, I. A., Nagashima, T., Plummer, D. A., Righi, M., Rumbold, S. T., Stevenson, D. S., Strode, S. A., Sudo, K., Szopa, S., and Zeng, G.: Analysis of present day and future OH and methane lifetime in the ACCMIP simulations, Atmos. Chem. Phys., 13, 2563–2587, https://doi.org/10.5194/acp1325632013, 2013. a
Watanabe, S., Hajima, T., Sudo, K., Nagashima, T., Takemura, T., Okajima, H., Nozawa, T., Kawase, H., Abe, M., Yokohata, T., Ise, T., Sato, H., Kato, E., Takata, K., Emori, S., and Kawamiya, M.: MIROCESM 2010: model description and basic results of CMIP520c3m experiments, Geosci. Model Dev., 4, 845–872, https://doi.org/10.5194/gmd48452011, 2011. a
Weatherhead, E. C., Bodeker, G. E., Fassò, A., Chang, K.L., Lazo, J. K., Clack, C. T. M., Hurst, D. F., Hassler, B., English, J. M., and Yorgun, S.: Spatial coverage of monitoring networks: A climate observing system simulation experiment, J. Appl. Meteorol. Clim., 56, 3211–3228, https://doi.org/10.1175/JAMCD170040.1, 2017. a
Weigel, A. P., Knutti, R., Liniger, M. A., and Appenzeller, C.: Risks of model weighting in multimodel climate projections, J. Climate, 23, 4175–4191, https://doi.org/10.1175/2010JCLI3594.1, 2010. a
Williamson, D., Blaker, A. T., Hampton, C., and Salter, J.: Identifying and removing structural biases in climate models with history matching, Clim. Dynam., 45, 1299–1324, https://doi.org/10.1007/s003820142378z, 2015. a, b
Wood, S. N.: Generalized additive models: an introduction with R, 2nd edn., CRC Press, New York, USA, 2017. a, b
Wood, S. N. and Augustin, N. H.: GAMs with integrated model selection using penalized regression splines and applications to environmental modelling, Ecol. Model., 157, 157–177, 2002. a
Wood, S. N., Bravington, M. V., and Hedley, S. L.: Soap film smoothing, J. R. Stat. Soc. BMet., 70, 931–955, https://doi.org/10.1111/j.14679868.2008.00665.x, 2008. a
World Health Organization: Air quality guidelines global update 2005: Particulate matter, ozone, nitrogen dioxide, and sulfur dioxide, WHO, Regional Office for Europe, Copenhagen, http://www.euro.who.int/__data/assets/pdf_file/0005/78638/E90038.pdf (last access: 28 February 2019), 2005. a
World Meteorological Organization: Scientific Assessment of Ozone Depletion: 2010: Pursuant to Article 6 of the Montreal Protocol on Substances that Deplete the Ozone Layer, World Meterological Organization, Geneva, Switzerland, 2011. a
Wu, S., Mickley, L. J., Jacob, D. J., Rind, D., and Streets, D. G.: Effects of 2000–2050 changes in climate and emissions on global tropospheric ozone and the policyrelevant background surface ozone in the United States, J. Geophys. Res., 113, D18312, https://doi.org/10.1029/2007JD009639, 2008. a
Young, P. J., Archibald, A. T., Bowman, K. W., Lamarque, J.F., Naik, V., Stevenson, D. S., Tilmes, S., Voulgarakis, A., Wild, O., Bergmann, D., CameronSmith, P., Cionni, I., Collins, W. J., Dalsøren, S. B., Doherty, R. M., Eyring, V., Faluvegi, G., Horowitz, L. W., Josse, B., Lee, Y. H., MacKenzie, I. A., Nagashima, T., Plummer, D. A., Righi, M., Rumbold, S. T., Skeie, R. B., Shindell, D. T., Strode, S. A., Sudo, K., Szopa, S., and Zeng, G.: Preindustrial to end 21st century projections of tropospheric ozone from the Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP), Atmos. Chem. Phys., 13, 2063–2090, https://doi.org/10.5194/acp1320632013, 2013. a, b
Young, P. J., Naik, V., Fiore, A. M., Gaudel, A., Guo, J., Lin, M., Neu, J. L., Parrish, D. D., Rieder, H. E., Schnell, J. L., Tilmes, S., Wild, O., Zhang, L., Ziemke, J. R., Brandt, J., Delcloo, A., Doherty, R. M., Geels, C., Hegglin, M. I., Hu, L., Im, U., Kumar, R., Luhar, A., Murray, L., Plummer, D., Rodriguez, J., SaizLopez, A., Schultz, M. G., Woodhouse, M. T., and Zeng, G.: Tropospheric Ozone Assessment Report: Assessment of globalscale model performance for global and regional ozone distributions, variability, and trends, Elementa, 6, p. 10, https://doi.org/10.1525/elementa.265, 2018. a, b, c