We have developed a new statistical approach (M

Tropospheric ozone is a pollutant detrimental to human health
and has been associated with a range of adverse cardiovascular and
respiratory health effects due to short-term and long-term exposure

A useful endeavor for producing an accurate representation of the global
surface ozone distribution is to combine the output from many models in a way
that takes advantage of the strengths of each model and minimizes the
weaknesses. Such efforts have already been made for both climate and
chemistry–climate models. For example, multi-model output has been combined
using a parametric approach, either by assigning an equal or optimum weight
to each model

For the case of simply averaging the output from multiple climate models, most studies either explicitly or implicitly assume that every model is independent and is a random sample from a distribution, with the true climate as its unbiased mean. This implies that the average of a set of models converges to the true climate as more and more models are added. This multi-model ensemble often outperforms any single model in terms of the predictive capability. Undeniably, when one has several dozen or hundreds of possible ensemble members, the most straightforward and efficient approach is to simply take the ensemble average, ignoring the impact of potentially erroneous outlier ensemble members. From a statistical point of view, one might argue that ruling out potentially erroneous ensemble members prior to conducting the ensemble mean would yield an even better result, especially if the overall number of ensemble members is small.

Combining model ensembles using a method more sophisticated than the simple
average is a challenge because a meaningful model evaluation can rarely be
condensed into a single metric, and there is no technique that can explicitly
quantify the degree of similarity (i.e., both accuracy and precision) between
two different spatial fields

This paper presents a new statistical approach (M

Past estimates of global mortality due to long-term ozone exposure have
relied on surface ozone fields produced by global atmospheric chemistry
models due to the limited coverage of the global ozone monitoring network

Section 2 provides details of the data sources and fusion process, including the techniques to register all data sources onto a common grid and the statistical model used to minimize the difference between interpolated observations and the multi-model combination. In Sect. 3, the results of employing these techniques are presented, including the mapping accuracy, evaluation of regional model performance and the final multi-model bias correction. The paper concludes with a summary and discussion in Sect. 4.

TOAR has gathered ozone observations through 2014 at most sites and has
chosen 2008–2014 as a “present-day” window for more rigorous analysis. The
purposes of the multi-year average are to reduce the effects of ozone
interannual variability, which is largely driven by changes in meteorological
conditions

The output from each individual model is shown in Fig. S1 in the Supplement.
Note that NASA G5NR-Chem has the finest resolution of these models;
accordingly, we aim to produce our final product on the same

List of the ensemble members used in this paper.

In order to compare model output to observations, we need to register model
output and observations to a common grid. This registration enables us to
quantify the differences between the models and observations. Previous
attempts have usually relied on a variant from a general statistical
interpolation framework to combine incompatible spatial data

Following is a description of our method for fusing observations and output
from multiple global atmospheric chemistry models to produce a surface ozone
product with maximized accuracy. This method is known as Measurement and
Multi-Model Fusion (version 1), or M

Due to this study's human health focus, we do not consider ozone above the
data-sparse oceans. Above land, large observational gaps are present across
Africa, the Middle East, South America, and south and southeast Asia, where
the spatial interpolation is generally too uncertain to yield a reliable
surface ozone approximation. The ozone estimates in these regions must come
from either models or distant observations, neither of which is ideal to
solve this issue. As a compromise strategy, we fill these gaps with a weighted
model product evaluated by the interpolated ozone observations. We propose
the following procedure to combine model output and observations for data
integration:

In this study, we carry out the spatial interpolation by using the combination
of the integrated nested Laplacian approximation (INLA) framework

We carry out the statistical interpolation via the following steps:
(1) calculate the ozone metric at each TOAR site and for every year in
2008–2014; (2) perform the statistical interpolation using all available
sites with their exact coordinates and project the surface onto a

We use a bilinear interpolation to smooth model output from coarser
resolution to a

Due to the sparsity of stations in many regions, we use a predefined
geometric boundary to differentiate regions. A more meaningful physical
boundary (i.e., regions with similar chemical regimes or major features such
as deserts, mountain ranges or water bodies) might be determined using a
cluster analysis technique

Since we partition the global land surface into eight regions and evaluate
the models individually, inevitably there will be disjointed boundaries
between regions. The boundaries between North and South America, or between
east Asia and Oceania, fall mostly in the oceans, so we do not need to adjust
these regions. However, we should make an adjustment to disjointed boundaries
that fall across inhabited areas (see Fig. S2 for the illustration). As an
example of our method, consider the boundary between east Asia and Russia
near 50

We adopted a regression weighting approach that only accounts for the mean
spatial fields of the interpolated ozone and model output, rather than the
underlying associated uncertainty. We take this approach due to the
prohibitive size of high-resolution output (over 1 million output points for
each model) but also due to the lack of a thorough investigation regarding
the ideal method for combining models based on different sources of
uncertainty. For example, the interpolation uncertainty can be quantified
easily through the posterior distribution and considered to be related to
measurement error (small scale) or sparse sampling across a region (large
scale); however, model uncertainty is a different concept altogether that
could result from input uncertainty (e.g., air pollution emissions
inventories) or limitations of the transport and chemistry mechanisms within
the model

Ground-based measurements were available from 4766 stations reported in the
TOAR database

TOAR observations where
the monitoring locations are discretized to a

Figure

Estimates of spatially interpolated surface ozone distribution and associated uncertainty (half width of the 95 % credible interval from each cell).

The ozone metric for each model was calculated for each individual grid cell
in each year, then averaged over 2008–2014 and registered to the common

Multi-model mean and standard deviation (SD) in each grid cell from six ensemble members.

Averaging all six models captures the large-scale variations of the ozone
distribution; however, many regions in northern midlatitudes and low latitudes are
biased high compared to the observations in the TOAR database. A simple
approach to address the uncertainty in the multi-model mean is to calculate
the standard deviation for each grid cell from the different models, as shown
in Fig.

It should be noted that the spatially interpolated observations are smoother in regions with fewer sites and reveal a more detailed structure in regions with a dense station network. In contrast, the multi-model mean is more noisy. Even though we average across multiple years and multiple models, the resulting ozone metric can still be noisy because it is calculated at each grid cell independently. In order to make maximum use of the skill of each model, we restrict the model evaluation to the regional scale in the next section.

To evaluate the performance of each model in a given region, we calculate the
mean differences over all grid cells within the region and summarize them
with the root mean square error (RMSE). Let

Next, we select three regions with extensive monitoring: North America, Europe
and east Asia. Figure

Spatial distributions of the ozone metric in North America from each model minus spatially interpolated observations.

Spatial distributions of the ozone metric in Europe from each model minus spatially interpolated observations.

Spatial distributions of the ozone metric in east Asia from each model minus spatially interpolated observations.

We argue that the credibility of the model is not entirely decided by the RMSE (i.e., the mean difference): the smoother the difference plots, the easier it is to carry out the model bias correction. Indeed, the observations and model output are not expected to match point by point. We should also expect the model to capture the general pattern of the spatial distribution, rather than a pointwise agreement.

The estimated weights from the constrained least squares (Eq.

The last column of Table

We combine all models according to the optimum weights from each region for
each model. Figure

RMSEs (averaged errors in a given region) between spatially
interpolated observations and each model, along with regionally optimized
weights

Multi-model composite and bias-corrected surface.

The last step of producing the final fused surface ozone product is to apply
a bias correction to our multi-model composite, limited to just those areas
in close proximity to ozone observations. Ideally, we would like to apply a
bias correction according to raw observations, but most stations are not
exactly located on the model grid coordinates (even at

The choice of the correction range, in this case 2

The fused product can be evaluated in terms of spatial correlation using the variogram which assumes that spatial correlation is not a function of absolute location but only a function of distance (i.e., stationarity). Since spatial variability and continuity from the models are the result of geophysical processes represented by mathematical equations, the variogram must be customized for each field. In addition, the extremely large size of the model output prohibits us from carrying out a standard empirical variogram analysis, which requires calculating the variance of the difference between all pairwise grid cells.

Nevertheless, we provide examples of omnidirectional variograms for the
spatial field in North America from each model and product in Fig. S5. The
standard variogram analysis focuses on the following three parameters:
(1) the nugget (variance at zero distance, which represents a subgrid
variation), which is similar for all cases; (2) the sill (total variance of a
field), where the variogram value reaches a maximum and levels off; (3) the
range (a distance where the sill is reached, and beyond that there is no
longer spatial correlation). Note that a continuously increasing variogram
indicates the evidence of nonstationarity in the field, which is the case
for SPDE, an issue that we have accounted for. The variogram peak is about
35–40

Since the raw observations are the only reliable source for validating our
results, we align each model grid to observed locations for evaluating the
predictive performance. The RMSEs of the residuals from all observations in
2008–2014 are displayed in Table

RMSE against TOAR observations (i.e., not interpolated ozone) from the multi-model mean (MMM), multi-model composite (from fusion step 2) and the final fused product (from fusion step 3).

Our multi-model composite outperforms the multi-model mean in terms of lowest mean predicted error. Based on the spatially interpolated observations, the resulting multi-model composite takes advantage of the strengths of each model and achieves a better accuracy. This result proves that our approach is effective, since our interim product has already improved upon the simple multi-model mean. The bias correction further reduces the residuals: this is expected because the spatial kriging algorithm is designed to minimize the difference to observations; thus, it has the lowest RMSE (this value is the same for the kriging result and the fused product since we apply the correction based on observed locations). The RMSE of approximately 5 ppb may represent the interannually varying meteorological influence during the years 2008–2014. If this is the case, then 5 ppb may approximate the minimal RMSE that can be achieved in a multi-year analysis.

In summary, the simple multi-model mean method may perform fairly well at the
continental or regional scale but does not provide an accurate
representation of the subregional structure; this is of course a limitation
on the use of coarse model resolutions. The weighting applied during the
construction of the multi-model composite improved the accuracy but the
effect could be limited, because many small-scale processes are not (yet)
resolved by the models. To alleviate the discrepancy further, a statistical
method based on local observations is applied to correct the bias. The
advantage of our fused surface ozone product over the simple multi-model mean
can be clearly seen in Fig.

Map showing result for multi-model mean minus the fused surface ozone.

In this article, we present a
flexible framework to incorporate observations and multiple models for
providing an improved estimate of the global surface ozone distribution.
Combining multivariate spatial fields in the estimation of ozone distribution
is an extension of both the conventional multi-model ensemble approach (i.e.,
simple average) and a statistical bias correction approach, and was found to
improve the prediction of surface ozone. In summary, our approach has the
following properties:

The multi-year average enables us to reduce the meteorological influence
on surface ozone. An extension of this method to time-resolved multi-annual
fields can be expected to capture the interannual variability

The INLA-SPDE interpolation framework allows for modeling of potential nonstationarity in the spatial processes.

Regional model evaluation facilitates a feature selection for multiple competing atmospheric models.

Local bias correction of the multi-model composite only at a limited range of grid cells avoids using the spatially interpolated ozone field in regions associated with higher levels of uncertainty.

For the regions with dense monitoring networks (such as North American, Europe, South Korea and Japan), the final fused product was obtained mainly from the interpolation of observations; elsewhere, the final product relied on the multi-model composite through an optimized weight from each model.

Human health studies typically adopt a fine grid resolution, such as a

The application of our methodology focuses on, but is not limited to, a
particular ozone metric relevant for quantifying the impact of long-term
ozone exposure on human health. We expect that this framework could also be
applied to other ozone metrics relevant to crop production or natural
vegetation

In general, atmospheric chemistry model estimates of surface ozone levels are
biased high, as demonstrated by a comparison of the annual mean surface ozone
produced by the ACCMIP (Atmospheric Chemistry and Climate Model
Intercomparison Project) multi-model ensemble to the TOAR surface ozone
database (see Fig. 6 of

The sources of the TOAR data and the output from four
CCMI models are listed in Sect. 2.1; the output from the GFDL-AM3 model is
archived at GFDL and is available to the public upon request to Meiyun Lin;
G5NR-Chem model output is
available for download at

In this paper, the aim of spatial interpolation is to use (discretized)
monitoring observations to build a statistical surrogate model for estimating
the ozone distribution over the whole domain on a sphere. We assume that this
ozone distribution follows a Gaussian process (GP). A GP is a collection of
random variables such that any subset of the observations has a joint
Gaussian distribution. It has been widely used in many applications as a
machine learning algorithm

Let

The specification of a GP is through its mean function and covariance
function, denoted by

The major disadvantage of using a GP is the computational complexity, which
typically involves a cubic complexity in the number of data points, usually
denoted as

This INLA-SPDE technique also enables us to quantify the level of
nonstationarity in a spatial field by employing basis function
representations for both

We now illustrate a series of statistical model fits to select the best
predictive ability of the SPDE model. To choose the maximum number of basis
functions for the parameters

RMSE is the measure of the overall mean difference between predicted values and the observed values.

DIC (deviance information criterion) is a measure to compare performance of statistical models by using a criterion based on a trade-off between the goodness of fit and the corresponding complexity of the model. Smaller values of the DIC indicate a better balance between complexity and a good fit.

GCV (generalized cross validation) calculates the mean residuals in a leave-one-out
test. The model that minimizes the average predicted residuals over all the
data is selected as the best model

We estimate nine statistical models with different numbers of basis functions,
presented in Table

Summary of results from fitting nine candidate statistical models (annual average over 2008–2014).

The supplement related to this article is available online at:

KLC, ORC, JJW and MLS contributed to conception and design. ORC and MGS contributed to the acquisition of data. All authors contributed to the analysis and interpretation of data. KLC and ORC drafted the article, while all authors helped with the revision. All authors approved the submitted and revised versions for publication.

The authors declare that they have no conflict of interest.

This work was funded by the NASA Health and Air Quality Applied Sciences Team (grant no. NNX16AQ80G). Edited by: Tim Butler Reviewed by: two anonymous referees