Cluster-based ensemble means for climate model intercomparison

Clustering – the automated grouping of similar data – can provide powerful and unique insight into large and complex data sets, in a fast and computationally-efficient manner. While clustering has been used in a variety of fields (from medical image processing to economics), its application within atmospheric science has been fairly limited to date, and the potential benefits of the application of advanced clustering techniques to climate data (both model output and observations) may yet to be fully realised. In this paper, we explore the specific application of clustering to the calculation of multi-model 5 means from climate model output. A standard rudimentary approach to multi-model mean (MMM) calculation simply involves taking the arithmetic mean of all models in a given ensemble, over a particular space/time domain (a one model one vote approach). We hypothesise that clustering can provide a useful data-driven method of (a) excluding ‘poor’ model data from MMM calculations, on a grid-cell basis, thus (b) maximising retention of ‘good’ data, and avoiding the blanket exclusion of models, where appropriate. We focus our analysis on chemistry-climate model (CCM) output of tropospheric ozone – an 10 important greenhouse gas – from the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). Cluster-based MMM fields of tropospheric column ozone were generated from the ACCMIP ensemble using the Data Density based Clustering (DDC) algorithm. The cluster-based MMM was compared to the simple arithmetic MMM (one model one vote approach) and each MMM was evaluated against an observed satellite-based tropospheric ozone climatology, as used in the original ACCMIP study. As a proof of concept, we show the proposed clustering technique can offer improvement in 15 terms of reducing the absolute bias between the MMMs and observations. For example, the global mean absolute bias from the cluster-based MMM is reduced in all months, up to ∼15%, compared to the simple arithmetic MMM. On a grid-cell basis, the bias is reduced at more than 60% of all locations. Some locations are found to be unaffected by the clustering process, while in others the bias increases, albeit slightly. This and other caveats of the clustering techniques are discussed. Finally, while we have focused on tropospheric ozone, the principles underlying the cluster-based MMMs are applicable to other prognostic 20 variables from climate models. We further demonstrate that clustering can provide a viable and useful framework in which to assess and visualise model spread, offering insight into geographical areas of agreement between models and a qualitative measure of diversity across an ensemble. Copyright statement. © R Hyde, R Hossaini, A, Leeson. The article is distributed under the Creative Commons Attribution 4.0 License, https://creativecommons.org/licenses/by/4.0/ 25 1 Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-317 Manuscript under review for journal Geosci. Model Dev. Discussion started: 15 January 2018 c © Author(s) 2018. CC BY 4.0 License.

means from climate model outputensemble.A standard rudimentary approach to multi-model mean (MMM) calculation simply involves taking the arithmetic mean of all models in a given ensemble, over a particular space/time domain (a one model one vote approach).We hypothesise that clustering techniques can provide (a) a flexible, data-driven, method of testing modelobservation agreement and, (b) a mechanism with which to identify model development priorities.clusteringcan provide a useful data-driven method of (a) excluding 'poor' model data from MMM calculations, on a grid-cell basis, thus (b) maximising retention of 'good' data, and avoiding the blanket exclusion of models, where appropriate.We focus our analysis on chemistry-climate model (CCM) output of tropospheric ozonean important greenhouse gasfrom the recent Atmospheric Chemistry and Climate Model Inter-comparison Project (ACCMIP).Cluster-based MMM fields of tTropospheric column ozone were generated from the ACCMIP ensemble was clustered using the Data Density based Clustering (DDC) algorithm.
The cluster-based MMM was compared to the simple arithmetic MMM (one model one vote approach) and each MMM was evaluated against an observed satellite-based tropospheric ozone climatology, as used in the original ACCMIP study.As a proof of concept, wWe find thatshow a multi-model mean (MMM) calculated using members of the most-populous cluster identified at each location, the proposed clustering technique can offer improvement in terms ofoffers a reductioning of up to ~2016% in the absolute bias between the MMMs and an observed satellite-based tropospheric ozone climatology, with respect to a simple, all-model MMMobservations.For example, the global mean absolute bias from the cluster-based MMM is reduced in all months, up to ∼15%, compared to the simple arithmetic MMM.On a grid-cellspatial basis, the bias iis reduced at ~more than620% of all locations.We further demonstrate that clustering can provide a viable and useful framework in which to assess and visualise model spread, offering insight into geographical areas of agreement between models and a measure of diversity across an ensemble.Finally, we discuss Some locations are found to be unaffected by the clustering process, while in others the bias increases, albeit slightly.This and other caveats of the clustering techniques and note that re discussed.Finally, while we have focused on tropospheric ozone, the principles underlying the cluster-based MMMs are applicable to other prognostic variables from climate models.We further demonstrate that clustering can provide a viable and useful framework in which to assess and visualise model spread, offering insight into geographical areas of agreement between models and a qualitative measure of diversity across an ensemble.

Introduction
Clustering is a flexible and unsupervised numerical technique that involves the segregation of data into statistically similar groups (or "clusters").These groups can either be determined entirely by the properties of the data itself or guided by user constraints.Numerous clustering algorithms have been developed, each with varying degrees of complexity.The k-means clustering algorithm, for example, is a relatively simple and popular technique used in several atmospheric science problems (e.g., Mace et al., 2011;Qin et al., 2012;Austin et al., 2013;Arroyo et al., 2017).Specifically related to climate science, clustering has also been used for automated classification of various remote sensing data (e.g., Viovy, 2000), the interpretation of ocean-climate indices and climate patterns (Zscheischler et al., 2012;Yuan and Wood, 2012;Bador et al., 2015), in describing spatiotemporal patterns of rainfall (Muñoz Díaz and Rodrigo, 2004), and to classify surface ozone measurements from a large network of sites (Lyapina et al., 2016), among several other applications.An area where the applicability of clustering has yet to be fully explored is in the analysis of model ensembles; a collection of comparable output from either multiple models, or multiple realisations of the same model with perturbed physics or variations in forcing data.One example of a model ensemble is that generated during model output from multi-model inter-comparison initiativesprojects, involving chemical transport models (CTMs), climate models, or chemistry-climate models (CCMs).Such initiatives are now common and form an integral part of scientific assessment of atmospheric composition, particularly in international policy-facing research concerning climate change.For example, recent model inter-comparison studies have considered stratospheric ozone layer recovery (Eyring et al., 2010), the climate impacts of long-term tropospheric ozone trends (Young et al., 2013;Stevenson et al., 2013), and paleoclimatology (Braconnot et al., 2012), among others.
In virtually all mMulti-model ensembles are used to identify the most likely value for a given variable at a particular place/time, and a range of possible values for that variable, under the assumption that all model predictions are equally valid.
intercomparisons related to atmospheric composition, the multi-model mean (MMM) of a given prognostic variable is computed for a given space/time domain, commonly reported along with the model spread, or MMM standard deviation (σ).
In most instances, a multi-model mean (MMM) , the MMM is computed from a simple arithmetic mean of all models (i.e. a one model one vote approach), such as during the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP) studies of tropospheric ozone and the hydroxyl radical, OH (Young et al., 2013;Voulgarakis et al., 2013).For chemical species such as these, that exhibit large space/time inhomogeneity in their tropospheric abundance, rarely a single model will be universally best performing (i.e. at all locations/times).In this regard, a MMM is a useful quantity and is often considered a best estimate that includes robust features (that are still apparent after averaging) from the ensemble of models.
In these circumstances however, it is also of interest to consider how estimates differ between models (model spread), which is often characterised by the standard deviation of values from all models, for example in the studies referenced above.Model spread may be used to identify areas where the best estimate values may be more, or less, uncertain.For example, if all models agree at a given place/time then we can have confidence in the all-model MMM at that location.If all models do not agree, then Mmore involved MMM approaches may be taken.For example, this might somehow weight individual model contributions (e.g., DelSole et al., 2013;Haughton et al., 2015;Wanders and Wood, 2016), such as for example based on their performance against a set of observations, thus potentially diluting spurious features from individual models.However,though such approaches have been somewhat rarely implemented in recent CCM inter-comparisons and can only really be used for assessing past states, for which observations are available.Furthermore, it is not uncommon for individual models to be excluded entirely from the a MMM if deemed particularly poor on the basis of an evaluation against a set of observations (e.g., Hossaini et al., 2016), or if deemed a clear/substantial outlier with respect to the majority of other models (e.g., Eyring et al., 2010).
In this study, we hypothesise that clustering techniques can provide (a) a flexible, data-driven, method of testing modelobservation agreement and, (b) a mechanism with which to identify model development priorities.In terms of the former, clustering provides including/excluding selective model data into MMM calculations, providing (a) a data-driven method of grouping the model output at each place and time by how well each modelled values agrees with the ensemble as a whole.
This potentially enables refinement of the ensemble by objectively identifying outlier data at a given place and time on a caseby-case basis, thus potentially removing the need to for retention of "good" model data, relative to perform blanket model exclusions.In terms of the latter, clustering and provides(b) potential insight into model development needs through exploring the membership of the clusters, for example why a specific model may always be excluded from the most populous cluster at a particular location.We focus our analysis on tropospheric column ozone data from 14 atmospheric models (mostly CCMs) that took part in the ACCMIP inter-comparison (Young et al., 2013).Our specific objectives are to (i.) use clustering to subsamplederive a cluster-based MMM field of tropospheric column ozone estimates produced by the ensemble, , (ii.) generate a cluster-based MMM using this subsample and evaluate theis performance of the cluster-based MMM against more rudimentary MMM approaches by comparison to observations, and (iii.)explore the use of clustering as a tool to identify and visualise diversity across a model ensemble, and assesswith the potential of this method to inform model development.We demonstrate that, as a consequence of ensemble refinement through clustering, clustering can reduce the overall bias between modelled (i.e.MMM) and observed tropospheric column ozone is reduced, while maximising data retention of data from individual models is maximised.We also show that by using clustering to characterise model spread, we can highlight regions of time or space where our process-level understanding is presumably robust (i.e. the models are in close agreement) and where more work is needed to (a) understand why models disagree, and (b) improve our understanding of underlying physical processes driving these differences.Advantages of the clustering approach over more traditional weighting methods are discussed, as are limitations of the techniques and areas of future development.
The paper is structured as follows.Section 2 provides a brief overview of cluster-based classification.Section 3 describes the principles of the proposed clustering technique, exemplified using an idealised synthetic data set.Section 4 describes the specific application of the clustering techniques to multi-model output from the ACCMIP inter-comparison.Results from the ACCMIP clustering and discussion are presented in Section. 5. Recommendations for future research are given in Sect.ion 6 and we make concluding remarks our conclusions are given in Sect.ion 7.

A brief overview of cluster-based classification
Clustering is a well establishedwell-established technique for the unsupervised grouping (classification) of similar data.The unsupervised nature of clustering overcomes many of the traditional short-comings of classification techniques, e.g.no a-priori information is required, classes (clusters) are data-driven and may adapt to underlying changes in the data relationships.Many offline clustering algorithms are available, and no single algorithm can be considered the 'best' for all situations.Several indepth reviews of clustering techniques have recently been published (Aggarwal and Reddy, 2014;Nisha and Kaur, 2015;Xu and Tian, 2015), therefore here we outline only briefly the features of some common techniques, in the context of this work.
Perhaps the most popular method employed within atmospheric science is the k-means clustering algorithm (MacQueen, 1967).K-means generates hyper-elliptical (i.e.elliptical over > 2 dimensions), unconstrained, clusters offering the benefit of fast processing and a constrained number of clusters.However, the method requires that the number of clusters is specified beforehand, limiting its usefulness in data mining and often means that the techniques results in clusters that fit the "required answer".Other algorithms that do not require prior knowledge of the data clusters and are therefore considered to be more data-driven, include subtractive clustering.This generates the required number of clusters, though is limited by a maximum cluster radius, thereby potentially dividing natural groups of data.This technique can also be prohibitively slow where large data sets are involved, as calculations are repeated for all remaining data samples after each cluster is formed.Recently, purely data-driven techniques have been developed, including grid-based algorithms and density-based algorithms.Many of these recent developments can match, or exceed, the older techniques for speed and, consistency, "accuracy" and have the added ability to be data-driven with minimal user intervention.As such, these techniques have the potential to provide powerful semiautomated insight into large data sets, such as output generated from individual atmospheric models, or a large ensemble of multiple models.In this study, we use the Data Density based Clustering (DDC) algorithm (Hyde and Angelov, 2014).The underlying principle is that data classified into a DDC-generated cluster is more similar to other data within said cluster, than it is to data within other clusters.The DDC algorithm has the advantage in that the scope of each cluster is well defined.For example, maximum distances can be set, in the physical world as well as in the data space, which define the spatial regions covered by clusters and the range of data values to be considered similar.DDC matches simple techniques such as k-means for speed but requires no prior information on the number of clusters.It is also robust to using larger cluster radii, as the algorithm adjusts the radii to match the data contained within the cluster.A simple application of the algorithm is described in Sect. 3 below.

The principles of cluster-based multi-model means
In this section we explain the principles behind the proposed technique for generating cluster-based MMMs, using a simple synthetic data set as an example.Application of the technique to real data is more complex and is detailed in later sections.
Chemistry-climate models attempt to simulate the atmospheric distribution of numerous chemical compounds including, for example, tropospheric ozone.Model skill/performance is typically assessed by comparison to atmospheric observations made at discrete times and locations.For a given comparison, a model may exhibit a phase offset in time or space, resulting in a large model-measurement bias, suggesting an inaccurate modelperhaps due to a process-level deficiency.However, in some cases phase offsets in space, for example, could be related to a sampling or 'mismatch' error, particularly when comparing output from coarse resolution models to point source observational data.Such errors are commonly encountered in inverse modelling studies, for example, that aim to derive top-down emissions of a given compound based on atmospheric observations (e.g., Chen and Prinn, 2006).To account for such, a flexible technique that looks beyond a specific space/time and that can identify similar data in the surrounding data space is required.To illustrate this, we use a simple 2D synthetic data set as shown in Figure 1.
The data shown in Figure 1 includes synthetic 'observations' (panel a) generated using a sin function.The values on the x and y axes are arbitrary and the data is intended to mimic a generic observation that is spatially non-uniform.We also consider 4 different sets of synthetic 'model' data (panel b) which, with respect to the observations, exhibit (1.) a small consistent positive bias (red), (2.) a small consistent negative bias (dark blue), (3.) a large bias (green), and (4.) a slight phase offset (cyan); clearly model 3 would be considered a poor/outlier model.Taking the 4 models to be an ensemble, a simple MMM is generated by taking the arithmetic mean of the 4 model data sets at each location (i.e.no clustering involved).We also apply the DDC algorithm to the data, as shown in panel (c), to generate a cluster-based MMM.The ellipses represent the different clusters that are formed which, as noted, can extend to nearby surrounding data space.
The DDC-based MMM is calculated by taking the mean of the data in the most populous dominant cluster at each location (hereafter the primary cluster); i.e. the.A cluster cluster is considered dominant if itthat contains the most data samples.We therefore assume that this is the most likely region to contain the observed value.For example, with reference to panel (c), a cluster is formed at ∼x=0.4,∼y=-0.8.Data within this cluster is not included in the MMM at this location, as a more populous cluster at the same location (∼0.4,∼0.6) is present.Panel (d) of Figure 1 compares each MMM to the observed data; the simple arithmetic MMM (one model one vote approach) provides poorer agreement compared to the cluster-based MMM, largely due to 'model 3' being included in the mean calculation for the former.Note, each MMM is independent of the observations and in this regard the process is analogous to a multi-model prediction of a future variable (i.e. with no observational constraint).

Overview of ACCMIP datasets
The Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP) was a multi-model initiative conducted to investigate the atmospheric abundance of key climate forcing agents, including tropospheric ozone, and their change over time (e.g., Young et al., 2013;Stevenson et al., 2013;Lamarque et al., 2013).For our purposes, we use the ACCMIP climate model data as an example of a typical multi-model ensemble on which to perform the clustering.A benefit of using ACCMIP output is that the data has been extensively handled and analysed by various groups, allowing direct comparison of our findings with published work, and the data is publicly available.We focus our analysis on modelled tropospheric column ozone data (Dobson Units) generated by 14 of the ACCMIP models (see Table A1).A detailed description of the models and their underlying processes can be found in the above ACCMIP publications.For each model, we analyse output from the historical simulation corresponding to the year 2000 (Young et al., 2013).Within ACCMIP, evaluation of models and the MMM was performed by comparison to a tropospheric ozone column climatology based on Ozone Monitoring Instrument (OMI) and Microwave Limb Sounder (MLS) satellite measurements (Ziemke et al., 2011).The monthly climatology extends from 60 • N to 60 • S. Following Young et al. (2013), we compare MMMs (generated either with clustering or without) to the observed climatology within this latitude range.

Cluster-based ensemble means
Here we illustrate the procedure by which cluster-based MMMs are generated using the DDC algorithm.The cluster-based methods are independent of observations but require a predicted truth.The predicted truth is a fundamental concept as it influences the decision of the algorithm on whether to include/exclude data from a given model, at a given location, into the MMM.The predicted truth can be generated in one of two ways.The first is a simple arithmetic mean of all the model data at a given location.The second is the average of the model data within 1σ of the arithmetic mean, referred to as the "sigmamean" predicted truth.Note, clearly both of these techniques can be used without clustering, with the former retaining 100% of the data in a MMM and the latter being essentially a data reduction technique.The schematic given in Figure 2 illustrates how the predicted truth is used by the DDC clustering algorithm.Again, arbitrary synthetic data is first used to exemplify the key principles.The synthetic data represents output from 8 different models that together form an ensemble (i.e. the 8 different connected lines).The ellipses represent different clusters that are assigned A-E.Again, we note the flexibility of the clusters in looking at the surrounding data space for similar data.For DDC generated clusters, at a given location the MMM of the ensemble is calculated as an average of all data in the cluster that contains the predicted truth (red diamonds).If no suitable cluster is nominated, then the predicted truth value for that location is used.For the example data shown in Figure 2, the data used for the DDC MMM at locations 1, 2 and 3 will be the arithmetic mean of the data from clusters C, C and C.This is because at each of these locations, the predicted truth lies in cluster C. Thus, the models included at locations 1, 2 and 3, are those denoted by the following colours: (green, cyan), (green, cyan), (green, cyan, purple, yellow).In this way, data farthest from the mean (predicted truth), i.e. not in agreement with other models, is removed locally.

Initialisation of clustering algorithms
Initialisation of the clustering algorithm involves selecting suitable initial cluster radii for each of the data dimensions, in this case: longitude, latitude and column ozone.In this work, we operate the clustering on a spatial basis only, to account for spatial mismatches as discussed in Sect.3. When selecting these radii, it should be noted that the clustering algorithms perform best with data on a similar scale in each axis.To this end we scale the data to approximately 0-1 in each dimension.

Ozone radius selection
Modelled ozone values are scaled to approximately 0-1 using the average minimum value and average range of the data in each month as given by Eq. ( 1 (1) Where  3 and  3 are the modelled and scaled ozone values, respectively, at location, , as estimated by model, , at time .
The initial ozone cluster radius is taken to be the average of twice the standard deviation on the model spread, Eq. ( 2): where ( 3( * ,,) ) is the standard deviation of the ozone values of the ensemble at time  at location , and  is the number of grid spaces.This corresponds to an initial radius of 8.3 DU (0.1523 when scaled as in equation 1).Note, the cluster radii evolve in a data driven manner, excluding outliers and extreme values from the clusters.In consequence, final cluster radii using DDC range from 0.1-8.3DU, with 70% of the primary clusters actually used in model selection for the MMM calculation having a radius <7 DU (Figure A1).This radius is indicative of the range of O3 data at each grid location, after outliers have been identifiedremoved by the clustering process.

Spatial radii selection
In later sections we show that our cluster-based MMM column ozone field exhibits a lower global mean absolute bias with respect to observations, compared to the simple arithmetic MMM.This improvement reduction in bias, due to the clusterbased subsampling,offered by clustering exhibits some sensitivity to the choice of initial radii in the spatial dimensions.In the latitude dimension, reduction in biasthe improvement exhibits a negative correlation with radius (r = -0.88);i.e. improvement bias is reduced to a lesser degreelessens with larger radii.Results are presented from here on for initial cluster radii of 1.5 gridcells (0.0683 when normalized to 0-1) and 2.5 grid-cells (0.0352) in the latitude and longitude direction respectively, as this combination was found to give the greatest reduction in model-observation bias best improvement overall.As in Sect.4.2.1., the cluster radii evolve in a data-driven manner and final cluster radii range from 1 -1.6 grid-cells (0.0455 -0.0728) in the latitude direction, and 1 -2.6 grid-cells (0.0141 -0.0367) in the longitude direction.Note, 92% and 99% of primary clusters identified in this study used in model selection for the DDC MMM have a radius of less than or equal to 1.1 grid-cells in the latitude and longitude directions, respectively.A radius of 1.1 grid-cells means that at each location, the primary cluster used in model selection potentially contains data from that cell and from cells with which it shares a border.While data from nearby grid-cells may affect the location of a cluster, this data is not included in the cluster-based MMM calculation; calculation of the MMM.Rather, the cluster-based MMM at each location is the mean of the data in the primary cluster at that location only, that is included in the nominated cluster.

Scenarios and Metrics
Using the principles described above, the DDC algorithm was applied to the ACCMIP model ensemble of tropospheric column ozone on a monthly basis, and an MMM value was calculated as an average of model values in the primary cluster at each location.As previously noted, the clustering algorithms require a predicted truth that can be calculated on a simple mean or a sigma-mean basis, thus 2 different permutations are possible for our cluster-based MMM.We also calculated MMMs of the same data using a simple arithmetic mean (all models included, equally weighted) and a sigma-mean, without clustering involved in either.The sigma-mean is essentially the average all model data within 1σ of the simple arithmetic meani.e. a very simple data reduction technique.In the subsequent Results sections, we compare each of these MMMs and evaluate their performance by comparison to the satellite-based tropospheric ozone climatology described in Sect.4.1.In particular, we focus the analysisnote on whether or not the cluster-based MMMs reduces model-observation bias with respect to provide 'improvement' over the most rudimentary approach, the simple arithmetic mean, that omits no model data.In summary, 3

Assessment of cluster-based MMM on a global basis
We first evaluate the relative performance of the cluster-based MMM with respect to the simple MMM on a global monthly mean basis.The observed column ozone data (DU) is presented in Table 1, along with equivalent MMM estimates, rows 2 and 3, obtained using a simple arithmetic mean approachas in Table 3 of Young et al. (2013) and a sigma mean approach.
These are followed by the cluster-based MMM obtained using from the DDC clustering 2 different scenariosmethod outlined in Section.3.For each MMM, the mean bias (equation 3) is given in Table 2. Note, the focus of this work is not to evaluate the skill of individual ACCMIP models, or the ensemble as a whole, with regard to underlying chemical processes.For that, an in-depth discussion can be obtained should be sought from Young et al. (2013).Rather, our focus is to assess the fidelity of the cluster-based MMMs relative to MMMs based on simpler approaches.Based on Tables 1 and 2 it is clear that the ACCMIP ensemble provide a reasonably good simulation of tropospheric column ozone with respect to the observations, in a global mean sense.For example, the annual mean bias for each of the various MMMs is <1 DU. .The cluster-based MMMs exhibit a bias (-0.7 DU) that is marginally greater thanto that obtained from the simple arithmetic MMM (-0.4 DU).However, note that the global mean biases reflect an amalgamation of positive and negative biases, masking important regional/hemispheric differences as outlined below.
Table 3 is similar to Table 2 but presents the absolute biases, again on a global mean basis.The cluster-based MMMs exhibits lower global mean absolute biases in all months relative to those obtained from the simple arithmetic mean approach (Figure 32).Both cluster-based MMM variants lead to improvements, , reducing the MMM global bias by ∼35-169%, depending on the month.While we do not over interpret our findings from a model process standpoint, a distinct monthly variability is apparent in the bias reduction, with the lowest overall bias reduction improvement in the months June-August.This is also the case for the (non-clustered) sigma-mean MMM, also shown in Figure 32, which exhibits a negative bias increase reduction (i.e.actually performs 'worse') with respect to the simple MMM during these months, despite offering a slight improvement bias reduction overall.From Tables 1 and 2, both the observed annual mean ozone column and the absolute (modelobservation) biases are lowest in these months.Based on the latter, it is perhaps unsurprising, therefore, that the impact ofrovement offered by sub-sampling through clustering in these months is relatively modest; .Recall, the clustering techniques exclude selective data from the MMM at a given location, from a given model, if there is poor agreement with other models in the ensemble.Thus, if all models agree well, regardless of whether their values are accurate or not, few (or no) model data may be removedexcluded.In this case, the cluster-based MMM will not vary substantially from the simple arithmetic MMM and relatively little (or no) 'improvement' i.e. bias reduction will be achieved observed through cluster-based sub-samplinging.
A similar situation also arises if the models have a wide spread of values at a given location; data excluded from the dominant cluster, and thus not included in the cluster-based MMMignored by the cluster-based MMM may be equally divided above and below the simple MMMpredicted truth (i.e.simple or sigma).In such a case, removing these data will have little effect and the cluster-based MMM will vary little from the simple MMM.

Assessment of cluster-based MMM: spatial variability
We extend the above discussion to evaluateing spatial variability in the performance ofbiases between the various MMMs and the observations.Spatial variability of the monthly mean bias (model -observations, DU) for the simple MMM case is shown in Figure 43.A similar figure but for the cluster-based MMM is shown in Figure 54.We note that our analysis agrees with As was shown in Young et al. (2013), i.e. the ACCMIP ensemble tends to exhibit a high bias with respect to the observations in the Northern Hemisphere (NH), and a low bias in the Southern Hemisphere (SH, Figure 43).The positive and negative biases largely cancel yielding an overall small negative bias when expressed as a global mean (see Table 2).Based on Figures 4 3 and 54, differences between the simple rudimentary MMM and the cluster-based MMM are difficult to fully discern by eye.The differences are more apparent when viewed as absolute biases, as given in Figures 6 5 and 76.However, most striking is Figure 8, that7, which compares the improvement, i.e. the reduction in model-observation absolute bias for the cluster-based MMM, relative to the simple arithmetic MMM.Geographically, cluster-based ensemble sub-sampling reduces the model-observation biasing provides some improvement at all latitudes, though particularly in the NH and including over central Asia, Europe and the USAwhere ozone precursor emissions are generally elevated due to anthropogenic processes.Note, the ACCMIP ensemble overestimates the ozone column climatology in the NH (e.g.see Figures 4 3 and 65, and previously see also Young et al. 2013).As such, thus effectively the NH improvement bias reduction seen in the cluster-based MMM effectively reflects some removal of data at the upper end of the model range (i.e.those models with relatively high ozone).Typical bias reduction is of the order of 1-5 DU, though larger reductions of >5 DU are achieved found in both hemispheres in some grid-boxes.Also apparent from Figure 8 7 are regions, particularly in the SH, where the bias reduction from clustering is negative; that is, the cluster-based MMM agrees less well with the observations than the simple arithmetic MMM.To understand this, one must consider that the clustering approach relies in some way on the density of model data points within the ensemble data space.If data from a given model is less in agreement with the other models within the ensemble, but closer to the observed value, data from said model will not be included in be removed from the cluster-based MMM.While this is a limitation of the approach, iIt is also this feature of the clustering process that allows for the model spread of an ensemble to be readily investigated and this is discussed in following sections.For example, the clustering algorithm provides information regarding which models are included where and when in the MMM values (see below section).In general, Hhowever, we note that the majority of the grid cells see a positive improvement in bias reduction through cluster-based sub-sampling.For example, Figure 9 8 shows a binary map plot of areas where the bias reduction is positive (i.e.improved, red), negative (worse, blue) and where there is no change (white).On an annual mean basis, ~>6205% of grid-cells exhibit a positive bias reduction and a further are improved while ∼7409% are improved or unchangedexhibit no change in the bias.Additionally, 29% of gridcells exhibit a negative bias reduction (i.e. the agreement becomes 'worse').Importantly, the magnitude of the positive bias reductions greatly exceeds those of the negative changes as can be seen from the histogram given in Figure A2.This suggests that the outliers removed from the ensemble tend to be those in relatively strong disagreement with the observations.

Insights from cluster population into model spread
Figure 9 shows a histogram of the ratio between the number of members in the second most populous cluster (cluster 2 hereafter) and the number of members in the most populous cluster (primary cluster, cluster 1 hereafter) at all points in space/time.A small number indicates that there is a significant difference, i.e. that cluster 1 has many more members than cluster 2. This suggests that the model spread is sufficiently small for most models to be included in cluster 1, and thus the models that are excluded from this cluster can be considered outliers.Conversely, if this number is large, this suggests that model spread is larger at these locations/times.As such, both cluster 1 and cluster 2 can probably be considered equivocal in terms of representing the ensemble.As can be seen from Ffigure 9, in the majority of cases we consider, cluster 1 has significantly more members that cluster 2. This confirms that, in the majority of cases, sub-sampling the ensemble based on the membership of cluster 1 can be considered to be robust.It is important to note however that there is tail of data points with ratio values > 0.5 for which sub-sampling based on cluster 1 is less reasonable.
We assess the degree to which the ratio between number of members in cluster 2 and cluster 1 varies in space and time (Figure 10).Higher ratio values tend to occur in the mid-latitudes (suggesting greater model spread), with tropical locations exhibiting lower ratios in general.There also appears to be some seasonality to the signal; higher ratios (thus greater model spread) are more likely to occur during the summer months.It is interesting to note that regions where the ratio >0.5 seems, by eye, to coincide with regions where the model-observation bias is increased when the ensemble is sub-sampled to the membership of cluster 1.This suggests that by excluding data here we are in fact removing data points which are in closer agreement with the observations.However, in general we calculate no statistically significant correlation between the ratio values and the change (if any) in bias.

Insights from cluster membership ing into model agreement and spread
We investigate the degree to which individual models are typically included/excluded from the primary clustered MMM by counting the number of months where that model is included used, at each location, as shown in Figure 110.This offers a simple mechanism to visualise model spread more generally; outlier models are more often excluded, models which fall in the pack are more often included.This information can be used together with Figure 76 as a means to identify which models are potentially driving model-observation biases in terms of MMM values, areas warranting further investigation, and so identify potentially priorities for model development.We outline some examples here but do not intend this to be exhaustive, more indicative of how Tthis reasoning/approach potentially provides a useful framework to guide further investigation.
For example, mModel G, for example, differs significantly from the cluster-based ensemble packmean in the mid-latitude NH, over both land and ocean, as evidenced by the fact that it is virtually always excluded in this region.Similarly, Mmodel K differs substantially from the other models in the SH, while model N is consistently different over South America in particular; this (potentially pointsing towards a spurious model feature concerning ozonee.g.regional precursor emissions here).Model K is often not included in the primary cluster at SH locations, suggesting that it differs substantially from the other models in this region.However, In the case of model K, for example, it should be stressed that this does not necessarily suggest that the model is in disagreement with observations bad in the SH as a whole, merely that the Mmodel K differs from the others.In fact, as was noted earlier, the cluster-based MMM agrees less well withto observations in the SH, compared to the simple MMM, meaning that model Kwhich will have been excluded during the clustering processcould be closer to reality (observations) in this region, relative to the other models.This reasoning/approach potentially provides a useful framework to guide further investigation.We note that all models are included at some locations, i.e. there is no blanket exclusion of certain models from the primary clusterusing these clustering techniques.In fact, some models, e.g.models C, I and J, are almost always included in the primary clustered MMM at each location.This, suggests that these models produce ing modelled ozone fields that are somewhat typical and in broad agreement with the ensemble mean.

Future Work
While the principles presented here are robust and proven to be beneficial, some areas of methodological development/refinement have been identified.For example, we currently assign all model data from the ensemble a cluster membership and then we use this information to include/exclude model data into an MMM.We have yet to consider the impact of weighting data within a cluster by (a) distance from cluster centre and (b) distance from location of simple MMM (as opposed to a simple include/exclude rule).Similarly, in future work we will look at the possibility of using clustering to generate a weighted all-model MMM, where ensemble members are weighted according to their cluster membership, i.e. members of the most populous cluster contributing more to the MMM than the less populous clusters and clear outliers.We also intend to explore the application of clustering in time, in addition to the mainly spatial methods presented here.Further, at present clusters are allowed to form in three dimensions, latitude, longitude and the predicted column ozone.In this way we allow for a degree of uncertainty in the model output.Future work will build on this by developing methods to incorporate estimates of standard deviation and range associated with the modelled mean values into our techniques, thus enabling a more sophisticated treatment of uncertainty.Finally, forthcoming model inter-comparison initiatives, e.g.CMIP6, will provide an excellent opportunity to apply our methods to consider parameters other than ozone that are of atmospheric interest (e.g.other short-lived climate forcing agents).

Concluding remarks
In this paper, we have investigated the applicability of an advanced data clustering method as an analytical/diagnostic tool with which to examine multi-model climate output.Relative to more rudimentary approaches, clustering offers a flexible method to evaluate inter-model differences.The technique operates by grouping data at a given location based on the density of data points.The flexibility arises as the clustering method examines surrounding data space (e.g.spatially) to account for small spatial/mismatch errors (e.g.arising due to differing coarse model grids), thus offering an advantage over more traditional inter-comparison methods.The clustering technique was applied to simulated fields of tropospheric column ozone from the 14 CCMs that took part in the ACCMIP model inter-comparison.We demonstrate that a cluster-based MMM tropospheric column ozone field, calculated using those data which are members of the most populous cluster at each location, exhibits a lower absolute bias with respect to observations, compared to a simple arithmetic MMM approach.On a global mean basis this reduction is observed in all months and, in some months, is as high as ∼2016%.Additionally, we show that clustering offers a useful framework in which to readily identify and visualise model spread and outliers.We suggest that such techniques could prove valuable in the identification of model development areas and provide insight surrounding regional strengths/deficiencies of specific models (or an ensemble as a whole), and to help characterise uncertainty.Finally, while we have focused on tropospheric ozone, we note that there is broad scope to develop the application of these techniques within the atmospheric sciences to examine other compounds of climate-relevance.Observations are a satellite-based climatology (Ziemke et al., 2011).Model data is from the historical (year 2000) ACCMIP simulation.The simple MMM is the arithmetic mean of all models, while the sigma mean MMM excludes data outside of 1 standard deviation from the simple MMM, and the DDC MMM was generated through cluster-based subsampling.

Figure 1 :
Figure 1: Principles of the cluster-based multi model mean (MMM) method illustrated using a synthetic data set.(a) A synthetic spatially-varying observation (X).(b) Predictions of X from 4 idealised models (see main text).(c) Cluster analysis of the model data sets using the DDC clustering algorithm.Ellipses represent the different clusters that are formed, and the black crosses are outliers not included in the clusters.(d) Comparison of the MMM of X derived from either a simple arithmetic 5 mean of all model data (red) or one based on clusters (green).Observation data from panel (a) is again shown in black.

Figure 2 .Figure 32 .Figure 43 .
Figure 2. Synthetic data used to illustrate the different ensemble methods.Model data is represented by the coloured lines with markers.The red diamonds are the predicted truth and the asterisks are cluster centres.

Figure 54 .
Figure 54.As Figure 4 3 but for the cluster-based MMM.

Figure 65 .
Figure 65.Monthly absolute bias (DU) between the simple arithmetic multi-model mean (MMM) tropospheric ozone column 5 and the observed climatology.

Figure 76 .
Figure 76.As Figure 6 5 but for the cluster-based MMM.

Figure 87 .
Figure 87.Monthly bias reduction (DU) defined as the difference in the absolute bias between the cluster-based MMM ozone column and observations, and the simple arithmetic MMM and observations.Where the bias reduction is positive (i.e.red) indicates areas where the cluster-based MMM agree better with the observations than the simple arithmetic MMM.In the title 5 of each panel, the global mean absolute bias reduction, and the absolute bias reduction summed over all grid-cells are shown.

Figure 98 .
Figure 98.As Figure 78 but showing a binary of grid-cells in which the model-observation bias has reduced (red), increased (blue) or not changed (white), as a result of the cluster-based ensemble sub-sampling.10

Figure 9 :Figure 10 :Figure 101 .
Figure 9: Histogram of ratio of number of members in second most populous cluster (cluster 2) to most populous cluster (cluster 1).

Table 1 .
Observed and multi-model mean (MMM) global tropospheric ozone column (DU) between 60 °N to 60 °S latitude.

Table 2 .
Global monthly mean bias (DU) in tropospheric ozone column, see Eq. (1), between the various MMMs and observations presented in Table1.

Table A1 .
Summary and citations for the ACCMIP models/data sets used in this work