We present a weighting strategy for use with the CMIP5 multi-model archive in the fourth National Climate Assessment, which considers both skill in the climatological performance of models over North America as well as the inter-dependency of models arising from common parameterizations or tuning practices. The method exploits information relating to the climatological mean state of a number of projection-relevant variables as well as metrics representing long-term statistics of weather extremes. The weights, once computed can be used to simply compute weighted means and significance information from an ensemble containing multiple initial condition members from potentially co-dependent models of varying skill. Two parameters in the algorithm determine the degree to which model climatological skill and model uniqueness are rewarded; these parameters are explored and final values are defended for the assessment. The influence of model weighting on projected temperature and precipitation changes is found to be moderate, partly due to a compensating effect between model skill and uniqueness. However, more aggressive skill weighting and weighting by targeted metrics is found to have a more significant effect on inferred ensemble confidence in future patterns of change for a given projection.

The CMIP5 archive

These underlying assumptions have been challenged by a number
of studies over recent years. Various studies

In addition, the models that are present in the archive are not equally
skillful in representing the present-day or past climate

Some studies have suggested methodologies that might be able to address some
of these complexities;

While this “replicate Earth” produces a product that significantly reduces
the mean bias of the combined model product (a 30 % reduction in root mean square difference
(RMSE)
compared to a simple multi-model mean;

In this study, we present a weighting scheme for use in the Climate Science
Special Report (CSSR), which informs the fourth National Climate Assessment for
the United States (NCA4). The requirements for this application are somewhat
unique – in that a method from the literature cannot be simply taken “out
of the box” from an existing study. Traceability and simplicity are
paramount for this application, where the derived weights are defined in this
paper, but then form the basis of a number of varied analyses performed by
the author team for the CSSR. Hence, the use of statistical meta-models as in

Our methodology is based on the concepts outlined by

Ideally, the method would seek to have two fundamental characteristics. First, if a duplicate of one ensemble member is added to the archive, the resulting mean and significance estimate for future change computed from the ensemble should change as little as possible. Second, if a relatively poor (for the metrics considered) model is added to the archive, the resulting mean and significance estimates should also change as little as possible.

Observational datasets used as observations.

Our analysis differs in a number of ways from that originally proposed by

The analysis region contains the conterminous United States (CONUS) and most of Canada, constrained by available high-resolution observations of daily surface air temperature and precipitation.

Inter-model distances are computed as simple RMSE here, in contrast to the multi-variate PCA used by

The weights for skill and independence are the final product
in this analysis, whereas they only inform the subset choice in the study by

We utilize data for a number of mean state fields, and a number of fields,
which represent extreme behavior – these are listed in Table

All
observations and model data are first linearly interpolated to a common 1

Distances are evaluated as the area-weighted RMSE over
the domain. Each matrix corresponding to each variable is then normalized by
the mean pairwise inter-model distance, such that for each field in
Table

These normalized matrices are then linearly combined, with each line in
Table

Submodel components for the 38 CMIP5 models considered in this study.

Continued.

A graphical representation of the inter-model distance matrix for CMIP5 and a set of observed values. Each row and column represents a single climate model (or observation). All scores are aggregated over seasons (individual seasons are not shown). Each box represents a pairwise distance, where warm colors indicate a greater distance. Distances are measured as a fraction of the mean inter-model distance in the CMIP5 ensemble. Smaller distances mean the datasets are in closer agreement than larger distances

The RMSE between observations and each model can be used to produce an
overall ranking for model simulations of the CONUS/Canada climate (which is
illustrated by the overall model-observation distance in Fig.

A graphical representation of the model-observation distance matrix
for a number of variables, illustrating how different biases combine to
produce the overall model-observation distance in Fig.

The independence weights can be computed from the inter-model distance matrix

In limits, two identical models will produce a value of

Figure

As points of reference, we consider some models from the archive known to
have no obvious duplicates (HadCM3 and INMCM), which should not be
significantly down-weighted by the method. We also consider some models where
there are numerous known closely related variants submitted from MIROC, MPI
and GISS. It is desirable to choose a value of

Model independence weights (

Hence, by inspection of Fig.

The methodology described above assumes each model has submitted only one
simulation to the archive, but the method is robust to the inclusion of
multiple initial condition members from each model. If

The RMSE distances between each model and the observations are used to
calculate skill weights for the ensemble. The skill weights represent the
climatological skill of each model in simulating the CONUS/Canada climate,
both in terms of mean climatology and extreme statistics. The skill weighting

Subplots are functions of

An overall weight is then computed as the product of the skill weight and the
independence weight.

A more skillful representation of the present-day state does not necessarily translate to a more skillful projection in the future. In order to assess whether our metrics improve the skill of future projections at all, we consider a perfect model test where a single model is withheld from the ensemble and then treated as truth.

However, such a test can be overconfident because when some models are
treated as truth, there remain close relatives of that model in the archive,
which would be given a high skill weight and would inflate the apparent skill
of the metric in predicting future climate evolution. To partly address this,
we conduct our perfect model study with a subset of the CMIP5 archive, which
excludes obvious near relatives of the chosen “truth” model. We achieve
this by excluding any model that lies closer to the “truth” model than the
distance between the best-performing model and the observations in the
inter-model distance matrix

A graphical representation of models, which are excluded from the remaining ensemble in the perfect model test when each model in turn is treated as truth. Cells in black represent models, which are closer to each other than the best-performing model in the archive is to observations.

Once the obvious duplicates have been removed for a given “perfect” model

Finally, we test whether skill weighting the ensemble increases the chances
of the truth lying outside of the distribution of projections suggested by
the archive. For Fig.

We allow the weighted model projected changes in 2080–2100 temperature or precipitation at each grid cell to define a likelihood distribution for expected future change in the removed model. We then calculate the fraction of grid cells where the chosen perfect model's actual projected value for temperature or precipitation change lies above the 90th or below the 10th percentile of the inferred likelihood distribution. If the likelihood distribution is representative of expected change for the removed “perfect” model, one would expect a 20 % chance that the perfect model lies outside this range. However, if this value increases, it indicates that the weighting is too strong and the weighting is producing an under-dispersive distribution.

Figure

Using the values of

Model skill and independence weights for the CMIP-5 archive evaluated over the CONUS/Canada domain. Contours show the overall weighting, which is the product of the two individual weights.

Uniqueness, skill and combined weights for CMIP5 for the CONUS/Canada domain

Once derived, the skill and independence weights can be used to produce
weighted mean estimates of future change, as well as confidence estimates for
those projections. To illustrate this, we modify the significance methodology
from the fifth Assessment Report of the

large changes where the weighted multi-model average change is greater than double the standard deviation of the 20-year mean from control simulations runs and 90 % of the weight corresponds to changes of the same sign;

no significant change where the weighted multi-model average change is less than the standard deviation of the 20-year means from control simulations runs;

inconclusive where the weighted multi-model average change is greater than double the standard deviation of the 20-year mean from control runs and less than 90 % of the weight corresponds to changes of the same sign.

The standard deviation of the 20-year mean from control simulations is
derived using the “picontrol” simulations in CMIP5. We consider all
simulations with a length of 500 years or longer, and discard the first
100 years. The remaining time period is broken into consecutive 20-year
periods, and the estimate of control variability for each model is taken as
the standard deviation of the 20-year periods. This process is repeated for
all models with an appropriate simulation. Finally, the standard deviations
are averaged over all models to produce the final estimate for the standard
deviation of the 20-year mean from the control simulations (note this differs
slightly from

In order to adapt this methodology to a weighted ensemble, we need to apply the weights both to the mean estimate and the significance estimates.

To calculate the weighted average, each model is associated with a weight
(e.g., from Table

Therefore, the significance test is very similar to the IPCC case; if the weighted average exceeds double the control standard deviation, it is a significant change and if it is less than the standard deviation it is not significant.

Sign agreement is slightly modified from the IPCC case – rather than
assessing the number of models exhibiting the same sign of change, we
consider the fraction of the weight exhibiting the same sign of change,

We illustrate the application of this method to future projections of
temperature and precipitation change under RCP8.5 in Figs.

Projections of mean temperature change over CONUS/Canada in
2080–2100, relative to 1980–2000 under RCP8.5. Panels

As for Fig.

The parameter choices for

In the case of NCA4, the strategy was to produce multi-variate metrics which were specific to CONUS/Canada. However, there is an argument that there are aspects of non-local climatology which would ultimately impact the domain of interest (through their influence on global climate sensitivity, for example).

In Fig.

A series of plots showing root mean square errors evaluated over the
CONUS/Canada domain as a function of errors assessed over the global domain.
Each point corresponds to a single model in the CMIP5 archive. Plots are
shown for some individual fields

The strength of the skill-weighting corresponds to the parameter

However, here we consider the impact on temperature projections if a more
aggressive weighting strategy were used. In Fig.

As

Hence, we find that although a the skill weighting as used in NCA4 has only a
subtle effect on projected temperatures compared to the unweighted case,
there is a demonstrable effect when stronger weights are utilized, but there
is an increased risk of the weighted ensemble being under-dispersive
(Fig.

A plot showing the effect of skill-weighting strength on global
temperature projections. Panel

The requirements for NCA4 were such that a single set of weights should be
used for the entire report. However, for some application it might be
desirable to tailor a set of weights to optimally represent a particular
process or projection. Here, we consider how using weights assessed on
precipitation climatology alone could change the result of the projection.
The precipitation-weighted case is formulated identically to the multi-variate
case but distances are computed using RMSDs over the mean
precipitation field (over the CONUS/Canada domain) only; the selection of

Figure

Distribution of changes in annual mean grid-level
precipitation for the late 21st century under RCP8.5. Panel

We can illustrate this behavior by considering the spatial pattern of
precipitation change in the three cases, using unweighted
(Fig.

A precipitation-based metric, however, seems to make a noticeable difference to the confidence associated with the weighted projection. There is now clear and significant increases in precipitation in the northern part of the USA, and significant increases in the northeast. There is also more clearly defined drying along the west coast and significant drying over the northern Amazon, which was not evident in the unweighted or multi-variate case.

Hence, it seems that there is potential to constrain the spatial patterns of
fields that show significant spatial heterogeneity across the multi-model
archive by considering targeted metrics, which might be more directly
informative to relevant processes for that particular projection. One must be
cautious, as noted in Sect.

This study has discussed a potential framework for weighting models in a structurally diverse ensemble of climate model projections, accounting for both model skill and independence. The parameters of the weighting in this case were optimized for using the CMIP5 ensemble for the Climate Science Special Report (CSSR) to inform the fourth National Climate Assessment for the United States (NCA4), an application which required a weighting strategy targeted towards a particular region (CONUS/Canada), with a single set of weights that could be applied to a diverse range of projections.

The solution proposed in this study adapted the idea first discussed in the
context of model sub-selection in

It should be noted that although our likelihood-weighting function is empirical, the functional form satisfies in a simple way the required parameters of the weighting scheme. Though the structure of this functional form is not fundamental, it can simply be shown to have some useful features. The technique is presented in this paper in a form, which maximizes clarity and reproducibility, but its effect can be described in Bayesian language. The total model weight is the posterior likelihood of a given model representing truth. Each model's prior probability of representing truth is given by its independence weighting, and the likelihood function is defined for the multi-variate dataset using an assumed Gaussian likelihood profile in a space defined by the sum of the normalized RMSE differences over all variables between each model and the observations. However, the application in this paper is for a simple weighting scheme only and it is left to further study to formally implement such concepts in a Bayesian framework.

The method provides a single set of weights constructed for NCA4, using a multi-variate climatological skill metric and a limited domain size. Two parameters must be determined for the weighting algorithm; a radius of model skill and one of similarity. The former was calibrated by considering a perfect model test where a single model is treated as truth and its historical simulation output is treated as observations, immediate neighbors of the test model are removed from the archive and the remaining models are used to conduct tests, which assess skill in reconstructing past and future model performance, as well as assessing the risk of producing an under-dispersive ensemble, which fails to encompass the perfect future projection at a given grid point. Using these three tests, we take a conservative choice for model weighting, which minimizes the risk of under-dispersion (i.e., the risk that the real world might lie outside the entire weighted distribution of projections at a given grid point).

The similarity parameter is calculated in a qualitative fashion by
considering cases where models are known to be relatively unique, or where
there is a known set of closely related models. The parameter is adjusted
such that the known unique models are given a weight of near unity, and the
models with

The requirements of a large assessment places constraints on the choice of parameters for this analysis. Logistical considerations imply that only one set of weights can be constructed, and the broad readership and high stakes of the assessment mean that any risk of under-dispersion of projected future climate is unacceptable for this application. These constraints dictate that only a moderate weighting of model skill is used, where 90 % of the weight is allocated to 80 % of models. This, unsurprisingly, creates only a modest change in mean projected results and only a small reduction in uncertainty. A stronger skill weighting is shown to have a more significant effect on projected changes, but with the risk of increased under-dispersion.

In addition, there exists a weak trade-off between model skill and model uniqueness in the CMIP5 ensemble; models which are demonstrably high performing also tend to be the ones with the most near replicates in the archive. Therefore, there is a compensating effect of the skill and uniqueness components of the weighting algorithm, which tends to mute the effect of the overall weighting when compared to the unweighted case. In other words, the unweighted CMIP5 ensemble is in fact already a skill-weighted ensemble to some degree.

However, although this tradeoff is evident in the CMIP5 archive, there is no guarantee that such a tradeoff is a justification for using an unweighted average in future versions of the CMIP archive. A single, highly replicated but climatologically poor model present in a future version of the archive could significantly bias the simple multi-model mean of a climatological projection. Therefore, it is desirable to have a known and tested weighting algorithm in place to produce robust projections in the case of highly replicated, or very poor models.

Beyond the single set of weights produced for NCA4, the basic structure
outlined in this study can be used to produce a more targeted weighting for a
particular projection (as was conducted for sea ice projections in

With this in mind, we propose that future studies should further investigate how selection of physically relevant variables and domains should be used to optimally weight projections of future climate change, and that individual projections will need careful consideration of relevant processes in order to formulate such metrics. Confidence in such weighting approaches is highest if there are well understood underlying processes that explain why the chosen metric constrains the projection. Until then, we have presented a provisional and conservative framework, which allows for a comprehensive assessment of model skill and uniqueness from the output of a multi-model archive when constructing combined projections from that archive. In so doing, we come to the reassuring conclusion that for this particular application (i.e., domain and variables) the results that would be inferred from treating each member of the CMIP5 as an independent realization of a possible future are not significantly altered by our weighting approach although the localized details of confidence in the magnitude of precipitation changes may be affected. However, by establishing a framework, we make the first tentative steps away from simple model democracy in a climate projection assessment, leaving behind a strategy, which is not robust to highly unphysical or highly replicated models of our future climate.

Complete MATLAB code for the analysis conducted in this
manuscript is provided. All CMIP5 data used in this analysis are downloadable
from the Earth System Grid
(

The authors declare that they have no conflict of interest.

The authors acknowledge the support of the Regional and Global Climate Modeling Program (RGCM) of the US Department of Energy's, Office of Science (BER), Cooperative Agreement DE-FC02-97ER62402. Edited by: Steve Easterbroo Reviewed by: Craig H. Bishop and one anonymous referee