Earth System Model Evaluation Tool (ESMValTool) v2.0 – diagnostics for emergent constraints and future projections from Earth system models in CMIP

The Earth System Model Evaluation Tool (ESMValTool), a community diagnostics and performance metrics tool for evaluation and analysis of Earth system models (ESMs), is designed to facilitate a more comprehensive and rapid comparison of single or multiple models participating in the Coupled Model Intercomparison Project (CMIP). The ESM results can be compared against observations or reanalysis data as well as against other models including predecessor versions of the same model. The updated and extended version (v2.0) of the ESMValTool includes several new analysis scripts such as large-scale diagnostics for evaluation of ESMs as well as diagnostics for extreme events, regional model and impact evaluation. In this paper, the newly implemented climate metrics such as effective climate sensitivity (ECS) and transient climate response (TCR) as well as emergent constraints for various climate-relevant feedbacks and diagnostics for future projections from ESMs are described and illustrated with examples using results from the well-established model ensemble CMIP5. The emergent constraints implemented include constraints on ECS, snowalbedo effect, climate–carbon cycle feedback, hydrologic cycle intensification, future Indian summer monsoon precipitation and year of disappearance of summer Arctic sea ice. The diagnostics included in ESMValTool v2.0 to analyze future climate projections from ESMs further include analysis scripts to reproduce selected figures of chapter 12 of the Intergovernmental Panel on Climate Change’s (IPCC) Fifth Assessment Report (AR5) and various multi-model statistics.


Introduction
Climate models are important tools not only to improve our understanding of the key processes in present-day climate but also to project future climate change under different plausible scenarios. Climate models have been continuously improved and extended over the last decades from relatively simple atmosphere-only models to the complex state-of-theart Earth system models (ESMs) participating in the latest (sixth) phase of the Coupled Model Intercomparison Project (CMIP6; Eyring et al., 2016a). The increasing complexity of the models is needed to represent key feedbacks that affect climate change but is also likely to increase the spread of climate projections across the multi-model ensemble (Eyring et al., 2019). This poses a challenge for evaluation and analysis of the model results that requires efficient tools able to handle the increasing number of variables, processes and also the increasing data volume.
The Earth System Model Evaluation Tool (ESMValTool), released in a first version in 2016 , has been developed with the aim of taking model evaluation to the next level by facilitating analysis of many dif-A. Lauer et al.: Earth System Model Evaluation Tool (ESMValTool) v2.0 ferent ESM components, providing well-documented source code and scientific background of implemented diagnostics and metrics and allowing for traceability and reproducibility of results (provenance). This has been made possible by a lively and growing development community continuously improving the tool supported by multiple national and European projects. Version 2.0 (v2.0) of the ESMValTool has been developed as a large community effort to specifically target the increased data volume of CMIP6 and the related challenges posed by analysis and evaluation of output from multiple high-resolution and complex ESMs.
For this, the core functionalities have been completely rewritten in order to take advantage of state-of-the-art computational libraries and methods to allow for faster, more efficient and user-friendly data processing . Besides many technical improvements, ESMValTool v2.0 includes new large-scale diagnostics for evaluation of Earth system models  and diagnostics for extreme events, regional model and impact evaluation and analysis of ESM results (Weigel et al., 2020). As part of a series of four articles describing the new features and diagnostics of the Earth System Model Evaluation Tool v2.0, this paper focuses on the newly included diagnostics for emergent constraints and for analysis of future projections from ESMs as well as multi-model products (Sect. 3.1) and the two new climate metrics: effective climate sensitivity (ECS) and transient climate response (TCR) (Sect. 3.2).
An emergent constraint is a relationship across an ensemble of models between some aspect of the Earth system sensitivity and an observable trend or variation in the current climate, which offers the possibility to reduce uncertainties in climate projections. Furthermore, emergent constraints can help guide model development onto processes crucial to the magnitude and spread of future climate change projections and to point out future observational priorities (Eyring et al., 2019). Emergent constraints implemented in ESMValTool v2.0 (Sect. 3.3) include seven different approaches to constrain ECS as well as constraints for the hydrological cycle intensification, snow-albedo effect, year of disappearance of summer Arctic sea ice, future Indian summer monsoon precipitation and climate-carbon cycle feedback.
For the analysis of ESM projections, ESMValTool v2.0 now includes diagnostics to reproduce selected figures from chapter 12 (Long-term Climate Change: Projections, Commitments and Irreversibility) of the IPCC AR5 . These include figures showing the change in a variable between historical and future periods, e.g., maps (2-D variables), zonal means (3-D variables), time series showing the change in certain variables from historical to future periods for multiple scenarios and maps visualizing change in variables normalized by global mean temperature change (pattern scaling) and the possibility to show statistical significance of changes when compared to natural variability and the degree of agreement between the models using the stippling and hatching methods as in . Furthermore, diagnostics tailored to analyze projections of sea ice such as calculation of the year of disappearance (sea ice extent below 1 million km 2 ) from a multimodel ensemble and to constrain the future austral jet position have been added. A newly implemented "toy model" can be used to generate synthetic members of a single dataset. When providing an estimate for the standard error of observations, e.g., from differences between different observational datasets, this toy model can be used to investigate and take into account the effect of observational uncertainty in model evaluation (Sect. 3.4). A summary is given in Sect. 4. The aim of this paper is to document and illustrate how these newly added ESMValTool "recipes", i.e., configuration files defining input, preprocessing, diagnostics and run-time options of the ESMValTool, can be used for model evaluation and analysis.

Models and observations
The open-source release of ESMValTool (v2.0) that accompanies this paper is intended to work with CMIP5 and CMIP6 model output (and partly also with CMIP3 if the required output has been generated), but the tool is compatible with any arbitrary model output, provided that it is in CFcompliant (CF: Climate and Forecast; http://cfconventions. org/, last access: 18 June 2020) NetCDF format and that the variables and metadata are following the CMOR (Climate Model Output Rewriter; https://pcmdi.github. io/cmor-site/media/pdf/cmor_users_guide.pdf, last access: 18 June 2020) tables and definitions (e.g., https://github. com/PCMDI/cmip6-cmor-tables/tree/master/Tables, last access: 7 November 2019, for CMIP6). These tables read in by the ESMValTool contain the definition of all variables, together with their metadata such as units and standard and long names. Observations used in the evaluation are detailed in the various sections of the paper (see also Sect. 6) and summarized in Tables 1 and 2 but should also be seen as examples as they can be easily replaced by other observational datasets provided they follow the CMOR convention. For selected observational datasets, CMORizing scripts are provided with the ESMValTool that contain detailed downloading and processing instructions to convert the datasets into a CMOR-like format that can be processed by the ESMValTool. These reformat scripts serve as examples for writing similar scripts for other observational datasets that do not follow the CMOR standard. Such other datasets that are not available via the obs4mips (https://esgf-node.llnl.gov/projects/obs4mips/, last access: 26 February 2020) or ana4mips (https://esgf. nccs.nasa.gov/projects/ana4mips/, last access: 30 June 2019) projects and for which no CMORizing scripts are provided can be used with the ESMValTool in two ways. The first is to write a new CMORizing script using an available one as a template to generate a local copy of CMORized data that can readily be used with the ESMValTool. This typically in-  (Brient and Schneider, 2016;Lipat et al., 2017;Sherwood et al., 2014;Tian, 2015)  3 Overview of recipes included in ESMValTool v2.0 for emergent constraints and future projections In this section, all diagnostics and metrics newly added to ESMValTool v2.0 for analysis of future projections from ESMs as well as the emergent constraints implemented are tails on the technical infrastructure of the tool including accepted data formats, data reference syntax (DRS) used for directory and file name conventions, available preprocessor functions, etc., we refer again to Righi et al. (2020). Further information can be found in the ESMValTool user guide, which documents all technical aspects of the tool as well as all available diagnostics; see https://docs.esmvaltool.org/ (last access: 1 September 2020).

Calculations of multi-model products
Multi-model means are commonly used to project climate change (IPCC, 2013(IPCC, , 2007 and are thus a useful quantity to The recipe recipe_multimodel_products.yml computes the multi-model ensemble mean for a set of models selected by the user for individual variables and different temporal resolutions (annual, seasonal, monthly). For this, all data are regridded to the same horizontal grid. In the example shown in Fig. 1, all models are regridded to the grid of BCC-CSM1-1 using a linear interpolation scheme. This task is done by the ESMValTool's preprocessor and defined in the recipe depending on the application and user requirements. The userdefinable configuration options include definition of the target grid (e.g., 2.5 • × 2.5 • ) and regridding scheme (e.g., linear, nearest, area weighted). Regridding/interpolation of the input data in time is currently not supported. For further details, we refer to the ESMValTool user guide (https://docs. esmvaltool.org/, last access: 1 September 2020). After selecting the region (rectangular region defined by the lowermost and uppermost longitudes and latitudes), the mean for the selected reference period is subtracted from the time series in order to obtain the anomalies for the desired period. In addition, the recipe computes the percentage of models agreeing on the sign of this anomaly, thus providing some information on the robustness of the climate change signal.
The output of the recipe consists of a contour map showing the time average of the multi-model mean anomalies and stippling to indicate locations where the percentage of models agreeing on the sign of the multi-model mean anomaly exceeds a threshold selected by the user (Fig. 1). The example in Fig. 1 shows a warming over the continents in the range of 1-2 K which is more pronounced than the warming over the ocean which is mostly in the range of 0.5-1.5 K in this scenario. The example also shows that the models largely agree on the sign of the temperature change with the most prominent exceptions found in parts of the Southern Ocean, Greenland and the North Atlantic. Furthermore, a time series of the area-weighted mean anomalies is plotted. For the plots, the user can select the length of the running window for temporal smoothing and choose to display either the ensemble mean with a light shading to represent the spread of the ensemble or each individual model separately (Fig. 2). The example in Fig. 2 shows an increase in global average June temperatures up to about 2060 when temperatures start to level off. By 2100, the four CMIP5 example models (MPI-ESM-MR, CNRM-CM5, BCC-CSM1-1 and IPSL-CM5A-LR) show a spread in temperature increase for the RCP2.6 scenario ranging from 0.7 to about 1.8 K.

ECS and TCR
The ECS is an important metric to assess the future warming of the climate system. It is defined as the change in global mean near-surface air temperature as a result of a doubling of the atmospheric CO 2 concentration compared to preindustrial conditions after the climate system has reached a new equilibrium (Gregory et al., 2004). Climate models of the CMIP5 model ensemble simulated an ECS ranging between 2.1 and 4.7 K (Flato et al., 2013). Using all available evidence of that time, IPCC AR5 assessed a "likely" range of ECS between 1.5 and 4.5 K in 2013 (IPCC, 2013). recipe_ecs.yml uses a regression method proposed by Gregory et al. (2004) to calculate ECS. Using the total radiative forcing F caused by the doubling of atmospheric CO 2 concentration and the climate feedback parameter λ, ECS is defined as ECS = F /λ. Both of these variables can be assessed by linear regression of the equation for radiative balance N = F −λ T , where N is the net radiation flux at the top of the atmosphere (TOA) and T the global mean near-surface air temperature change. N and T are both given as global and annual mean differences between the abrupt quadrupled CO 2 simulation and the linear regression of the pre-industrial control run. Figure 3 illustrates this regression for the CMIP5 multi-model mean. Moreover, it shows that the assumption of a linear climate feedback parameter is only an approximation. Using only the first 20 years (last 130 years) instead of all 150 years of the abrupt quadrupled CO 2 simulations results in a stronger (weaker) feedback, which again leads to a lower (higher) ECS. This demonstrates the different response of the climate system at different timescales, i.e., non-linear feedback processes. This diagnostic requires the input variables near-surface air temperature (tas), TOA incoming shortwave radiation (rsdt), TOA outgoing shortwave radiation (rsut) and TOA outgoing longwave radiation (rlut) from abrupt4xCO2 (quadrupling of CO 2 compared to pre-industrial conditions) and piControl (pre-industrial control) simulations. Figure 9.42a of Flato et al. (2013) shows the globally averaged mean near-surface air temperature (GMSAT) for the historical period of 1961-1990 plotted vs. ECS of several CMIP5 models. The latter quantity can be calculated by a regression method based on Gregory et al. (2004) as outlined above. A similar figure produced with recipe_flato13ipcc.yml implemented in ESMValTool v2.0 shows that there are no distinctive correlations between the historical surface temperatures and the ECS, which suggests that the ECS is not very sensitive to errors in the current climate in contrast to other sources of uncertainty (Fig. 4).
The TCR is defined as the global and annual mean nearsurface air temperature anomaly in the 1pctCO2 simulation (1 % increase in CO 2 per year) for a 20-year period centered at the time of CO 2 doubling, i.e., using the years 61 to 80 after the start of the simulation. The temperature anomalies are calculated by subtracting a linear fit to the piControl run for all 140 years from the 1pctCO2 experiment prior to the TCR calculation (Gregory and Forster, 2008). Figure 5 shows (a) a time series of the 1pctCO2 near-surface temperature anomalies from MIROC-ESM used to obtain TCR and (b) TCR values for different CMIP5 models calculated with recipe_tcr.yml.  1961-1990 (colors). Crosses indicate that the 80 % of models agree with the sign of the multi-model mean anomaly. The models used in this example are BCC-CSM1-1, MPI-ESM-MR and MIROC5 (r1i1p1 ensembles) for the RCP2.6 scenario. All models have been regridded to the BCC-CSM1-1 grid using a linear interpolation scheme. See Sect. 3.1 for details on recipe_multimodel_products.yml.   (Gregory et al., 2004). Shown is the relationship between the differences in global and annual mean top-of-the-atmosphere net downward radiative flux N (W m −2 ) and global and annual mean near-surface air temperature anomalies T (K) for the CMIP5 multi-model mean. Anomalies are calculated as difference between the abrupt4xCO2 experiment (quadrupling of CO 2 ) and the pre-industrial control run  (1 % increase in CO 2 per year) compared to the piControl simulation. (b) Transient climate response (in K) for CMIP5 models calculated with the method by Gregory and Forster (2008). For details on recipe_tcr.yml, see Sect. 3.2.

Emergent constraints
An emergent constraint utilizes an ensemble of ESMs together with observational data to constrain a simulated future Earth system feedback. A prerequisite for an emergent constraint is a robust relationship between, for example, changes occurring on seasonal or interannual timescales and changes found in ESM simulations of anthropogenically forced climate change (Eyring et al., 2019). If such a relationship can be explained by a plausible physical mechanism, an observational constraint of multi-model projections of quantities that cannot be observed directly might be possible. Such a non-observable quantity is, for instance, ECS. The technique of emergent constraints offers the possibility to reduce un-certainties in climate projections and can help guide model development by highlighting processes that are crucial to explaining the magnitude and spread of the modeled future climate change. Emergent constraints can also help point out the need for more and/or more reliable observations.
We would like to note that a limitation of the emergent constraints as currently implemented into the ESMValTool is that model interdependency, as in the original studies, is not explicitly taken into account. As some modeling groups share model components or code, the models are not all independent. Duplicated code, as well as identical forcing and validation data in multiple models, is expected to lead to an overestimation of the sample size of a model ensemble and may result in spurious correlations (Sanderson et al., 2015). As a possible approach, future implementations of these emergent constraints could, for example, apply a model weighting based on a model's interdependence (e.g., Knutti et al., 2017) or simply reduce the ensemble size by taking into account only models that are above a given yet-to-bedefined interdependence score. Table 2 summarizes the emergent constraints that have been implemented in ESMValTool (v2.0) including the observational datasets used and are described in the following.

Emergent constraints on effective climate sensitivity
recipe_ecs_scatter.yml calculates five emergent constraints for ECS (see Table 2). These are briefly described in Sect. 3.3.1 ("Covariance of shortwave cloud reflection" to "Tropical mid-tropospheric humidity asymmetry index").
The ECS values from the models are pre-calculated with recipe_ecs.yml (see Sect. 3.2) or can be taken from literature. The diagnostic calculates ECS vs. selected constraining parameters, such as the climatological Hadley cell extent from models, and fits a linear regression line to the data. If available, the observational uncertainty of a given observational dataset can be estimated. For this, the standard error of the observations is subtracted or added from or to the means before calculating the observational value (estimated minimum or maximum, respectively). In addition to the scatter plots of ECS vs. constraining parameter calculated by the diagnostic, the diagnostic also outputs the 25 %/75 % confidence intervals of the regression (i.e., uncertainty of the fit) and the 25 %/75 % prediction intervals of the regression (i.e., measure for the quality of the linear fit). By definition, 50 % of all model data points are within the 25 %/75 % prediction interval of the regression line. Examples of the different scatter plots that can be created by recipe_ecs_scatter.yml are shown in Fig. 6. It should be noted that because a different set of CMIP5 models might be used in the figures compared to the originally published emergent constraints, the figures could show some deviations to the ones published in literature. While the emergent constraints shown in Fig. 6a, c, d, e suggest ECS values in the upper range of the values given in IPCC AR5 (IPCC, 2007, 1.5 to 4.5 K), the emergent constraint shown in Fig. 6b suggests an ECS value in the lower range of the IPCC AR5 values. In addition to these five emergent constraints, recipe_cox18nature.yml implements an emergent constraint for ECS based on global temperature variability (Sect. 3.3.1, "Global temperature variability"), recipe_ecs_constraints.yml an emergent constraint based on the difference between tropical and midlatitude cloud fraction (Sect. 3.3.1, "Difference between tropical and midlatitude cloud fraction").

Covariance of shortwave cloud reflection
This emergent constraint uses the models' correlation of tropical low-level cloud (TLC) reflection with the underlying SST to constrain ECS (Brient and Schneider, 2016). The definition and calculation of the individual terms follows Brient and Schneider (2016): TLC regions are defined as the 25 % ocean areas between 30 • S and 30 • N with the lowest 500 hPa relative humidity. TLC reflection is calculated as the ratio of top-of-the-atmosphere shortwave cloud radiative forcing and insolation, both averaged over the TLC region. This is then used to calculate the regression coefficients of deseasonalized variations of TLC shortwave reflection and sea surface temperature in % per K used as an emergent constraint. In the example shown in Fig. 6a, data from the CMIP5 historical simulations between 1980 and 2005 are used for the models, observational/reanalysis data used in Fig. 6 are ERA-Interim (Dee et al., 2011) for relative humidity, HadISST (Rayner et al., 2003) for sea surface temperatures and CERES-EBAF (Ed2.7) (Loeb et al., 2012) for top-of-the-atmosphere radiative fluxes.
Climatological Hadley cell extent Lipat et al. (2017) found that the climatological mean Hadley cell (HC) edge latitude from CMIP5 models correlates with ECS. The HC edge latitude is calculated from first two grid cells from the Equator going south where the zonal average 500 hPa mass stream function changes sign from negative to positive (downward branch of the HC). The mass stream function is calculated from climatological December-January-February (DJF) means of the meridional wind fields. The correlation of the climatological HC extent with ECS found in CMIP5 models is explained by observations that show a correlation of variability in midlatitude clouds and cloud radiative effects with poleward HC expansion (Lipat et al., 2017). For the example shown in Fig. 6b, CMIP5 data from historical simulations and ERA-Interim (Dee et al., 2011) are used as a reference dataset for the years 1980-2005.

Lower tropospheric mixing index
Following Sherwood et al. (2014), the lower tropospheric mixing index (LTMI) can be used to constrain ECS and is calculated as the sum of small-scale mixing S and the largescale component of mixing D. S is calculated from relative humidity (RH) and temperature (T ) differences between 700 and 850 hPa and averaged over a tropical region between 30 • S and 30 • N defined by the upper quartile of the annual mean 500 hPa ascent rate within ascending regions: S = ( RH 700−850 /100 % − T 700−850 /9 K)/2. The large-scale component of mixing is the ratio of shallow to deep overturning: D = H ( )H (−ω 1 ) / −ω 2 H (−ω 2 ) with ω 1 the average of the vertical velocity at 850 and 700 hPa, ω 2 the av-  (Lipat et al., 2017), (c) lower tropospheric mixing index (LTMI) (Sherwood et al., 2014), (d) southern Intertropical Convergence Zone (ITCZ) index (Tian, 2015) and (e) tropical mid-tropospheric humidity asymmetry index (Tian, 2015) for CMIP5 models ( Sherwood et al. (2014) explain the correlation between LTMI and ECS in CMIP3 and CMIP5 models by convective mixing between the lower and middle tropical troposphere dehydrating lowlevel cloud layers at an increasing rate as climate warms. They argue that this rate of increase depends on initial mixing strength, which links the mixing to clouds feedbacks and thus ECS. Figure 6c shows an example of this emergent constraint applied to CMIP5 historical simulations using ERA-Interim data (Dee et al., 2011) as reference data. All datasets in this example cover the time period of 1980-2005.

Southern ITCZ index
The southern Intertropical Convergence Zone (ITCZ) index (Bellucci et al., 2010;Hirota et al., 2011) is defined as the climatological annual mean precipitation bias averaged over the south-eastern Pacific (30 • S-0 • , 150-100 • W) given in mm d −1 . The southern ITCZ index is used to quantify the double-ITCZ bias in CMIP3 and CMIP5 models and has been found to correlate with ECS (Tian, 2015). In the example shown in Fig. 6d, the ITCZ index has been calculated from CMIP5 historical model simulations averaged over the years 1980-2005. Tropical Rainfall Measuring Mission (TRMM) (Huffman et al., 2007) satellite data (v7) averaged over the years 1998-2013 have been used as observational reference.

Tropical mid-tropospheric humidity asymmetry index
The strong link found in CMIP3 and CMIP5 models between the double-ITCZ bias and simulated moisture, precipitation, clouds and large-scale circulation allows the double-ITCZ bias and thus ECS to also be related to mid-tropospheric humidity over the tropical Pacific (Tian, 2015). As shown by Tian (2015), spatial patterns of mid-tropospheric humidity and precipitation are similar as both are related to the ITCZ. This allows defining a tropical mid-tropospheric humidity asymmetry index to quantify the double-ITCZ bias in models and consequently constrain ECS. This index is defined as relative bias in simulated annual mean 500 hPa specific humidity compared with observations ((model − observation)/observation·100 %) averaged over the Southern Hemisphere (SH) tropical Pacific (30 • S-0 • , 120 • E-80 • W) minus the bias averaged over the Northern Hemisphere (NH) tropical Pacific (20 • N-0 • , 120 • E-80 • W) (Tian, 2015). The example for the tropical mid-tropospheric humidity asymmetry index shown in Fig. 6e is calculated from CMIP5 historical runs averaged over the years 1980-2005 and AIRS (v5) satellite data (Susskind et al., 2006) averaged over the years 2003-2010 as observational reference data.
Global temperature variability Cox et al. (2018) propose an emergent constraint for the ECS using global temperature variability. The latter is defined by a metric ψ which can be calculated from the global temperature variance (in time) σ T and the 1-year-lag autocorrelation of the global temperature α 1T by . (1) Using the simple "Hasselmann model" (Hasselmann, 1976), Cox et al. (2018) showed that ψ is linearly correlated with ECS in CMIP5 data. Since calculation of ψ only depends on the temporal evolution of the global surface temperature, there are many observational datasets available. In the original publication, data from HadCRUT4 (Morice et al., 2012) are used to construct the emergent relationship. In the ESMValTool, this is reproduced by recipe_cox18nature.yml, which only needs two variables: historical near-surface air temperature (tas) and ECS (see Sect. 3.2). The emergent relationship between ECS and ψ is shown in Fig. 7 including means and confidence intervals. The constrained range of ECS based on this plot is 2.2 to 3.4 K with a 66 % confidence interval, similar to Cox et al. (2018).

Difference between tropical and midlatitude cloud fraction
Volodin (2008) proposes an emergent constraint for ECS based on the distribution of clouds in global climate models. The study finds that models with high climate sensitivity show a higher total cloud cover over the southern midlatitudes and a lower total cloud cover over the tropics than the multi-model average. Thus, the difference in tropical total cloud cover (between 28 • S and 28 • N) and the SH midlatitude total cloud cover (between 56 and 36 • S) is negatively correlated with ECS. The original publication uses the CMIP3 ensemble and the International Satellite Cloud Climatology Project (ISCCP)-D2 dataset (Rossow and Schiffer, 1991) as observational reference, but the relationship also holds when using CMIP5 models. In the ESM-ValTool, this emergent constraint for ECS can be produced with recipe_ecs_constraints.yml, which uses CMIP5 historical runs averaged between 1980 and 2000 (Fig. 8). The observed values are based on ISCCP-D2 data and are taken from Volodin (2008).

Emergent constraints on the carbon cycle
Uncertainties in projections of future temperature using ESMs are high, in a large part due to uncertainties of emissions and feedbacks. Within the carbon cycle, feedbacks are usually split into the carbon cycle -climate feedback γ , which quantifies carbon to climate change, and the carbon cycle -CO 2 concentration feedback β, which is the carbon sensitivity to atmospheric CO 2 (Friedlingstein et al., 2006). γ is a positive feedback as climate warming reduces the efficiency of CO 2 absorption by the land and ocean, leading to more of the emitted carbon staying in the atmosphere, which in turn leads to additional warming. In contrast, β is a negative feedback because of the so-called CO 2 fertilization effect, where plants take up a higher amount of CO 2 for photosynthesis with increasing atmospheric CO 2 concentrations. Efforts have been made to reduce the uncertainties of these two carbon cycle feedback parameters. Wenzel et al. (2014) employed the emergent constraint described by Cox et al. (2013) for the long-term sensitivity of tropical land carbon storage to climate warming (γ LT ) to the interannual sensitivity of atmospheric CO 2 to interan- Figure 8. ECS vs. difference in total cloud cover between the tropics (28 • S-28 • N) and southern midlatitudes (56-36 • S) for CMIP5 models (orange dots). The orange line and shaded area show the linear regression line and its 95 % uncertainty range (estimated via bootstrapping). Together with the observational estimate (vertical blue line and shaded area), this can be used as an emergent constraint for ECS (Volodin, 2008). The observational range is based on ISCCP-D2 data (Rossow and Schiffer, 1991) and taken from Volodin (2008). Similar to Fig. 3a of Volodin (2008) and produced with recipe_ecs_constraints.yml (see details in Sect. 3.3.1, "Difference between tropical and midlatitude cloud fraction"). nual tropical temperature variability (γ IAV ) in CMIP5 models. The analysis from this paper can be reproduced using recipe_wenzel14jgr.yml with the emergent relationship being able to reduce the range of projected γ LT (Fig. 9). Input variables include net primary productivity (nbp), surface temperature (tas), gas exchange flux of CO 2 into the ocean (fgco2) from the experiment 1pctCO2, nbp, fgco2, tas from the emission-driven historical simulations (esmHistorical), as well as nbp from the esmFixClim1 (carbon cycle sees CO 2 concentration increase but radiation does not) simulations. The different simulations are included in γ IAV , which is estimated from both the 1pctCO2 experiment and the esmHistorical simulation, and then compared in the paper. The default observational datasets are NCEP reanalysis (Kalnay et al., 1996) for the surface temperature and the Global Carbon Project (GCP; Le Quere et al., 2015) for the carbon fluxes. Wenzel et al. (2016a) developed an emergent constraint for β on land in the extratropics and northern midlatitudes constraining the projected land photosynthesis with changes in the seasonal cycle of atmospheric CO 2 . The figures from this paper can be reproduced with recipe_wenzel16nat.yml, with Fig. 10 showing the emergent constraint reproduced with the ESMValTool. The unconstrained CO 2 fertilization effect lies at 40 ± 20 %, which can be narrowed down to 37 ± 9 % in high latitudes and 32±9 % in the extratropics with this emergent constraint. Input variables from the models needed to Figure 9. Relationship between long-term sensitivity of tropical land carbon storage to climate warming (γ LT ) and short-term sensitivity of atmospheric CO 2 to interannual temperature variability (γ IAV ) for CMIP5 models (markers with horizontal and vertical error bars) using the historical simulation. The red line shows the linear regression through the CMIP5 models; the vertical gray area shows the range of observed γ IAV . Produced with recipe_wenzel14jgr.yml, similar to

Emergent constraints on the year of disappearance of September Arctic sea ice
This sea ice diagnostic produces scatter plots of (a) mean of and (b) trend in historical September Arctic sea ice extent (SSIE) vs. the first year of disappearance (YOD). Here, YOD is defined as the first of five consecutive years in which the Arctic SSIE drops below 1 million km 2 (Wang and Overland, 2009). Sea ice extent is defined in the diagnostic as the total area of all grid cells in which the sea ice concentration is 15 % or larger, Arctic is defined as the region north of 60 • N. The annual minimum Arctic sea ice extent typically occurs in September. For this reason, September mean sea ice quantities are commonly used in literature for analyses of the timing of an ice-free Arctic (e.g., Massonnet et al., 2012;Sigmond et al., 2018). The two scatter plots in Fig. 11a and b are similar to those in Fig. 12.31a/c of Collins et al. (2013), respectively. In addition, the diagnostic produces a scatter plot

Emergent constraints on the snow-albedo effect
The recipe recipe_snowalbedo.yml computes springtime snow-albedo feedback values in climate change vs. springtime values in the seasonal cycle in transient climate change experiments following Hall and Qu (2006). The strength of the snow-albedo effect is quantified by the variation in net incoming shortwave radiation (Q) with surface air temperature (T s ) due to changes in surface albedo α s : Here, I t is the constant incoming solar radiation at the top of the atmosphere, α p the planetary albedo. The diagnostic produces scatter plots of simulated springtime α s / T s values in climate change (ordinate) vs. simulated springtime α s / T s values in the seasonal cycle (abscissa). These values are calculated as follows: -(ordinate values) the change in April α s (future projection -historical) averaged over NH land masses poleward of 30 • N is divided by the change in April T s (future projection -historical) averaged over the same region. The change in α s (or T s ) is defined as the difference between 22nd century mean α s (T s ) and 20thcentury-mean α s . Values of α s are weighted by April incoming insolation (I t ) prior to averaging.
-The seasonal cycle α s / T s values (abscissa values), based on 20th century climatological means, are calculated by dividing the difference between April and May α s (averaged over NH continents poleward of 30 • N) by the difference between April and May T s averaged over the same area. Values of α s are weighted by April incoming insolation prior to averaging. Figure 12 shows an example calculated from CMIP5 historical  and Representative Concentration Path- al. (2013) suggests that the CMIP5 models under-and overestimate springtime snow-albedo effect almost equally.

Emergent constraints on the hydrological cycle
The recipes recipe_deangelis2015nat.yml and recipe_li2017natcc.yml, newly developed for v2.0, reproduce the analysis from DeAngelis et al. (2015) and Li et al. (2017), respectively. DeAngelis et al. (2015 constrain the hydrologic cycle intensification with observed radiative fluxes and water vapor data. The recipe recipe_deangelis2015nat.yml reproduces their Figs. 1b (Fig. 13a) to 4 (Fig. 13b) as well as their extended data Figs. 1 and 2. Here, the analysis is shown for 17 CMIP5 models and includes monthly mean total precipitable water on a 1 • × 1 • grid from RSS (Remote Sensing System) version-7 microwave radiometer data (Wentz et al., 2007) and ERA-Interim reanalysis (Dee et al., 2011), as well as radiative fluxes from the Clouds and the Earth's Radiant Energy System Energy Balance and Filled (CERES-EBAF; Kato et al., 2013;Loeb et al., 2009) dataset. Figure 13a shows that energy sources and sinks readjust in reply to an increase in greenhouse gases, leading to a decrease in the sensible heat flux and an increase in the other fluxes; Fig. 13b shows that results from parameterization schemes using pseudo-k distributions with more than 20 exponential terms representing water vapor absorption and correlated-k distributions agree better with the observations than the other schemes. Li et al. (2017) relate the future Indian summer monsoon projections to the present-day precipitation over the tropi-  Young et al., 2018) and near-surface air temperature from ERA-Interim (Dee et al., 2011) for the years 1984-2000. Models with higher surface albedos over NH continents poleward of 30 • N typically have a larger contrast between snowcovered and snow-free areas, and hence a stronger snow-albedo feedback. Similar to Fig. 9.45a of Flato et al. (2013); for details on recipe_snowalbedo.yml, see Sect. 3.3.4. The 95 % confidence interval for the slope of the regression of clear-sky rsnst normalized by the incoming shortwave flux at TOA with the water vapor path (prw) over the tropical ocean (30 • S-30 • N), regridded to a 2.5 • latitude × 2 kg m −2 prw grid for different CMIP5 models (horizontal bars) and for data from CERES-EBAF (Kato et al., 2013;Loeb et al., 2009, rsnst) and RSS version-7 microwave radiometer data (Wentz et al., 2007, prw) together with ERA-Interim (Dee et al., 2011, prw)  cal western Pacific. With this relationship, they can correct the projected rainfall for models with too strong negative cloud-radiation feedback on sea surface temperature. The corrected values (see Fig. 14) do not show an increase in rainfall over the whole Indian summer monsoon (ISM) region under greenhouse warming and are expected to be more robust than the uncorrected projection (Li et al., 2017). The recipe_li2017natcc.yml reproduces their Figs. 1 and 2 for an ensemble of 22 CMIP5 models (Fig. 14) and their Fig. 1a for each of the individual models and the multi-model mean.

Climate model projections
In addition to the emergent constraints described in the previous section, ESMValTool v2.0 also includes new diagnostics specifically designed to analyze future climate projections from ESMs. This includes diagnostics using the multiple diagnostic ensemble regression used to constrain the future position of the austral jet, a "toy model" to allow for investigating the effect of observational uncertainty on model evaluation, diagnostics for reproducing selected figures from the climate projection chapter in IPCC AR5  and for analyzing future sea ice quantities. All of these new diagnostics in ESMValTool v2.0 are briefly described in the following sections.

MDER to constrain future austral jet position
The position of the austral jet stream is poorly modeled by CMIP5 models with a latitude range of 10 • within the ensemble and a mean bias towards the Equator. The recipe_wenzel16jclim.yml reproduces the study of  who used a process-oriented multiple diagnostic ensemble regression (MDER) to constrain the future jet position in the RCP4.5 scenario. MDER uses a stepwise regression scheme to identify the most relevant present-day diagnostics from a list of diagnostics provided as an input and links those to future projections via a multivariate linear regression scheme. With the diagnostics selected by MDER, the future quantity (in this case, the austral jet position) can be constrained with suitable observationally based data (here: ERA-Interim; Dee et al., 2011), following the same basic idea as emergent constraints (see also Sect. 3.3). Using this approach, the future jet position from CMIP5 models is bias corrected about 1.5 • southwards compared to the unweighted multi-model mean (Fig. 15).   and the ISM rainfall change between historical and RCP8.5 for the years 2070-2099 for different CMIP5 models. The red line indicates the present-day value for the western Pacific precipitation from observations as used in Li et al. (2017) estimated from the Global Precipitation Climatology Project (GPCP) dataset for 1980-1999(Adler et al., 2003. (b) Uncorrected ISM rainfall change ratio (% per • C) vs. the corrected ratio from CMIP5 models and the multi-model mean with the standard deviations shown as error bars. The rain data are normalized by the global mean near-surface temperature change. (c) Projected multi-model mean rainfall change errors and (d) corrected multi-model mean rainfall change over the Indian Ocean. Similar to Fig. 2 of Li et al. (2017) and produced with recipe_li2017natcc.yml (see details in Sect. 3.3.5).

Toy model
Synthetic datasets generated from "toy models" have been used in the literature for assessing the effectiveness of multimodel combination strategies and for estimating the effect of observational uncertainties on the correlation between forecasts and observational datasets (Massonnet et al., 2016). The toy model recipe implemented into ESMValTool v2.0 is based on the approach presented in Weigel et al. (2008) for simulating single-model ensembles from a Gaussian distribution, where the number of members and the standard deviation of the error are defined by the user. Following Weigel et al. (2008), the recipe takes as input a set of observations, y 1 , y 2 , . . . , y N and for each observation y i , M synthetic mem-bers x are generated from where y ∼ N (µ, 1), β ∼ N (0, β) and 1 , . . . , M ∼ N (0, √ (1 − α 2 − β 2 )) with the notation N (µ, σ ) referring to a random number drawn from a normal distribution with mean µ and standard deviation σ . The simulated value x m is obtained by multiplying y by α, the predictability of the observation, which is set to 1 in this instance, and by adding a vector of perturbations m and the scalar perturbation β . The simulated values have the following properties: 1. The simulated values x i,1 have the same climatology as the observations. 2. The mean correlation between the simulations x i,1 , . . . , x i,M and observation y i is determined by α (see toy model properties described in Weigel et al., 2008).
3. The parameter β describes the model underdispersion, where β = 0 corresponds to the case where the synthetic ensemble is well dispersed and covers the full range of uncertainties for a given correlation α. The underdispersion increases with β being limited to the range 0 ≤ β ≤ √ (1 − α 2 ).
Parameter β is introduced to control the dispersion. For welldispersed ensembles, skill is independent of the number of simulations involved, while for overconfident model ensembles, skill grows with the ensemble size. Given that β accounts for the dispersion, this approach leads α to represent a measure of predictability (Weigel et al., 2008). This toy model is based on very simplifying assumptions: (1) normality and stationarity: the climatology and the ensemble distributions are assumed to be stationary and normally distributed; (2) well-calibrated model climatology: each ensemble member has the same climatology as the observations; (3) stationary skill: spread and correlation do not vary from sample to sample; (4) predictable signal and observational errors: requires the signal to be given by αx, and therefore it is determined by the verifying observation (Weigel et al., 2008).
The predictability α is 1 since we are only interested in generating synthetic observations. Thus, the user only needs to define the standard deviation of the error. This term can be based on the observational uncertainty when available (e.g., as provided with the European Space Agency's Climate Change Initiative (ESA CCI) SST dataset; Merchant et al., 2014a, b) or estimated by the user, e.g., by calculating the standard deviation between different observational reference datasets (Bellprat et al., 2017).
We would like to note that in addition to the observational uncertainty itself, also spatiotemporal representativeness of observations plays an important role when evaluating models. Schutgens et al. (2017) showed that such representation errors remain even after spatial and temporal averaging and may be larger than typical measurement errors. In addition, also the calculation method of a quantity to be compared with observations can play an important role. This is, for example, the case when comparing satellite retrievals with model quantities that are not derived the same way. Application of satellite simulators such as the Cloud Feedback Model Intercomparison Project (CFMIP) Observation Simulator Package (COSP; Bodas-Salcedo et al., 2011) can help to reduce such uncertainties in model evaluation. Both of these aspects are not covered by the toy model, which only provides an estimate for the observational uncertainty itself.
For further discussion of this synthetic value generator, its general application to forecasts and its limitations, see Weigel et al. (2008). The recipe recipe_toymodel.yml writes a NetCDF file containing the synthetic observations. Due to the sampling of the perturbations from a Gaussian distribution, running the recipe multiple times, with the same observation dataset and input parameters, will result in different outputs (Fig. 16).

Climate projection chapter of IPCC WGI AR5
The recipe_collins13ipcc.yml reproduces a subset of the figures from the long-term climate change projections chapter of the IPCC AR5 (chapter 12, Collins et al., 2013). This new recipe in version 2.0 allows for reproduction of selected figures from AR5 to show changes between historical and future projections over the available CMIP models. It will also allow a faster analysis of the CMIP6 climate projections that are part of the Scenario Model Intercomparison Project (Sce-narioMIP;O'Neill et al., 2016). The recipe includes figures such as time series from historical periods to projections (including spread among models; see Fig. 17), horizontal maps for individual models as well as multi-model means (including stippling to indicate large changes with high model agreement and hatching to indicate areas with a small signal or low agreement of models; see Fig. 18) and vertical zonal mean plots (also including stippling and hatching to indicate significant changes). The example shown in Fig. 18 shows where the CMIP5 models project an increase in precipitation and where they project a decrease. This example  (Dee et al., 2011). With the user providing an estimate for the standard error, e.g., from differences between different observational datasets, this diagnostic can be used to investigate the effect of observational uncertainty. For details, see Sect. 3.4.2. also shows quite large regions where the projections are still uncertain; i.e., the multi-model mean signal is smaller than 1 standard deviation of the natural variability estimated from pre-industrial control simulations (hatching).
Most diagnostics scripts are set up in a generic way, so that in principle they can be used for any variable from the CMIP archive. The scripts have been tested for the variables indicated in Table 1. To be able to determine if a change signal is larger than natural variability the natural variability is calculated from the piControl runs, other than that the recipe uses historical and RCP runs. All diagnostics in this recipe with the exception of the emergent constraints on the year of disappearance of September Arctic sea ice (Sect. 3.3.3) do not use observations.

Sea ice
The sea ice diagnostics included in the ESMValTool (recipe_seaice.yml) have been extended with three new diagnostics. The first new diagnostic (seaice_trends.ncl) calculates the trend in sea ice extent or sea ice area from each model and reference observation(s) or reanalysis data that are given in the recipe. The diagnostic produces histogram plots of the trend distributions from all models and adds the reference datasets (here: HadISST; Rayner et al., 2003) as colored vertical lines. The user can specify the region (Arctic or Antarctic) and the month of the year for which sea ice area/extent is calculated. The trends are calculated over the full period specified in the recipe and the resulting plots are similar to those of Fig. 9.24c/d in Flato et al. (2013) . The example plot (Fig. 19) shows that the majority of CMIP5 models slightly underestimate the observed trend in summer sea ice extent over the time period of 1960-2005.
The second new diagnostic (seaice_yod.ncl) calculates the year of near disappearance of Arctic sea ice. The diagnostic creates a time series plot of September Arctic sea ice extent for each model given in the recipe and adds three multi-model statistics: mean, standard deviation and YOD. It optionally reads a list of pre-calculated model weights and adds the weighted multi-model mean time series including weighted multi-model standard deviation to the plot (see, for example, Fig. 7 of Senftleben et al., 2020). The example in Fig. 20 shows that there is a large spread in simulated sea ice extent among the CMIP5 models with individual models simulating a summer sea ice extent below 1 million km 2 already around the year 2025, while other models are still well above this threshold in 2100.
The third new diagnostic (seaice_ecs.ncl) calculates emergent constraints for YOD using mean or trend in sea ice extent. The diagnostic produces scatter plots of different historical and future sea ice metrics, similar to those in

Summary
In this article, diagnostics and metrics, newly implemented into the Earth System Model Evaluation Tool v2.0 to analyze projections from ESMs and emergent constraints for climaterelevant parameters including effective climate sensitivity, snow-albedo effect, climate-carbon cycle feedback, hydrologic cycle intensification, future Indian summer monsoon precipitation, land photosynthesis and year of disappearance of summer Arctic sea ice, are described and illustrated with examples using CMIP5 data.
The implemented multi-model products allow for an easy and quick overview of the multi-model ensemble mean and the inter-model agreement in the sign of the multi-model mean anomaly for a given variable, geographical region, season and time period. In addition to maps showing the anomalies and their inter-model agreement, the results are also given as anomaly time series showing each individual model and the multi-model ensemble mean, which can be used to estimate the inter-model spread.
ECS and TCR are climate metrics that can be used to estimate and compare the sensitivity of simulated nearsurface temperature from individual models to increased atmospheric CO 2 concentrations. With these metrics included in the ESMValTool, it is easily possible to group the models in high-and low-sensitivity models for further analysis.
Emergent constraints offer the possibility to use an ensemble of ESMs together with observations in order to constrain non-observable parameters such as simulated future Earth system feedbacks. Overall, seven emergent constraints are available in ESMValTool v2.0 for ECS: (1) covariance of shortwave cloud reflection using the models' correlation of the covariance of tropical low-level cloud reflection with the underlying SST (Brient and Schneider, 2016); (2) latitude of the climatological mean Hadley cell edge (Lipat et al., 2017); (3) atmospheric convective mixing calculated as sum of small-and large-scale component, the lower tropospheric mixing index (Sherwood et al., 2014); (4) bias in climatological annual mean precipitation over the southeastern Pacific, the southern ITCZ index (Tian, 2015); (5) midtropospheric humidity over the tropical Pacific, the tropical mid-tropospheric humidity asymmetry index (Tian, 2015); (6) global temperature variability (Cox et al., 2018); and (7) difference between tropical and midlatitude cloud fraction (Volodin, 2008). Two emergent constraints on the hy-  Flato et al., 2013, Fig. 9.24c). An observational estimate of the trend in summer sea ice extent from HadISST (Rayner et al., 2003) over the same time period is shown by the vertical red line. Produced with recipe_seaice.yml; for details, see Sect. 3.4.4. Figure 20. Time series of September Arctic sea ice extent for individual CMIP5 models (dashed gray lines), multi-model mean (thick red line) and multi-model standard deviation (area shaded between thin red lines) for scenario RCP8.5. The year of disappearance (sea ice extent below 1 million km 2 ) obtained from the CMIP5 multimodel mean is indicated by the vertical red line (similar to Collins et al., 2013, Fig. 12.31e). Produced with recipe_seaice.yml; for details, see Sect. 3.4.4. drological cycle are implemented: (1) a constraint on the hydrological cycle intensification that uses observations of radiative fluxes and water vapor (DeAngelis et al., 2015); and (2) a constraint on the future Indian summer monsoon using present-day precipitation data over the tropical western Pa-cific (Li et al., 2017). Additionally, emergent constraints are available for the carbon cycle: (1) future tropical land carbon storage (Wenzel et al., 2014); (2) projected land photosynthesis (Wenzel et al., 2016a). Also implemented are emergent constraints for the year of disappearance of September Arctic sea ice (Massonnet et al., 2012) and for the snow-albedo effect (Hall and Qu, 2006).
Various new diagnostics are available specifically for analysis of climate model projections. The MDER method has been implemented to constrain the projected position of the austral jet following Wenzel et al. (2016b). The method uses a stepwise regression to identify the most relevant diagnostics (calculated with present-day data) that are linked to projections of a quantity via a multivariate linear regression scheme. Observational data can then be used to constrain the projected quantity such as the future austral jet position.
A number of newly implemented diagnostics resembling selected figures from IPCC AR5 chapter 12  for analysis of climate model projections are grouped in one recipe. The diagnostics include time series and horizontal maps and vertical zonal maps including stippling and hatching to show significant changes between a climate projection scenario and a historical simulation. For the stippling and hatching, results from pre-industrial control runs are used to estimate internal variability of a variable, which is then used to assess whether simulated changes are significant or not. Diagnostics to analyze sea ice in climate model simulations are also grouped in one recipe. The new diagnostics include calculation of trends in sea ice area and extent, multi-model estimates for the year of disappearance of sea ice in climate projections and scatter plots of different historical and future sea ice metrics such as historical trend in sea ice extent vs. YOD. In addition, a "toy model" has been implemented into ESMValTool v2.0 that allows generating synthetic ensemble members from a single dataset (Weigel et al., 2008). When applied to observational data, this can be used to take into account observational uncertainty when comparing the observations with model results. For this, the user needs to specify the standard error of the observations that is provided with some observational datasets or estimated from differences between different observational datasets for the same quantity.
ESMValTool v2.0 is an open-source software tool that has been specifically developed to facilitate evaluation and analysis of Earth system models participating in CMIP. As such, it can process and analyze CMOR compliant model output and observational datasets with the particular aim to provide traceable and reproducible results, well-documented diagnostics and metrics and an efficient workflow allowing to evaluate models in more depth and more rapidly than it was typically possible in previous CMIP phases. The CMOR standard is, however, quite detailed and implemented in a relatively strict way in the ESMValTool in order to ensure data consistency and to minimize the probability of errors in the data processing. Increasing the flexibility of the CMOR check and automatic fixes of small inconsistencies is a currently ongoing activity and should make the data processing smoother, especially for datasets which are not part of CMIP or any CMIP-endorsed model intercomparison project (MIP). This means that a certain familiarity with these data standards is required in order to use the ESMValTool. Another limitation is that for license issues, observations cannot be distributed together with the software package. New users are required to download and process observational datasets before being able to use the tool or to have access to a computing center where observational data for the ESM-ValTool (i.e., CMORized) are already available. We are currently working on automating this process to facilitate the data retrieval and CMORization process.
The new ESMValTool is now available to the community for evaluation and scientific analyses of CMIP6 data. Thanks to a strong community involvement, the ESMValTool is constantly extended and improved in an effort to make the tool more user friendly, more efficient and a better tool for climate analyses. The ongoing ESMValTool development and discussions regarding new features can be followed on GitHub at https://github.com/ESMValGroup (last access: 1 September 2020). Feedback, bug reports and contributions by the scientific community are very welcome at any time.
Data availability. CMIP5 data are available freely and publicly from the Earth System Grid Federation (ESGF). Observations used in the evaluation are detailed in the various sections of the paper. The observational datasets are not distributed with the ES-MValTool that is restricted to the code as open-source software. Observational datasets that are available through the Observations for Model Intercomparisons Project (obs4MIPs, https://esgf-node. llnl.gov/projects/obs4mips/, last access: 26 February 2020) can be downloaded freely from the ESGF and directly used in the ESMVal-Tool. For all other observational datasets, the ESMValTool provides a collection of scripts (NCL and Python) with exact downloading and processing instructions to recreate the datasets used in this publication.
Author contributions. AL and VE coordinated the ESMValTool v2.0 diagnostic effort and led the writing of the paper. MR helped coordinate the diagnostic implementation and testing in ESMVal-Tool v2.0. All other authors contributed individual diagnostics to this release. All authors contributed to the text.