Reduced complexity model intercomparison project phase 1: Protocol, results and initial observations

Here we present results from the first phase of the Reduced Complexity Model Intercomparison Project (RCMIP). RCMIP is a systematic examination of reduced complexity climate models (RCMs), which are used to complement and extend the insights from more complex Earth System Models (ESMs), in particular those participating in the Sixth Coupled Model Intercomparison Project (CMIP6). In Phase 1 of RCMIP, with 14 participating models namely ACC2, AR5IR (2 and 3 box versions), 5 CICERO-SCM, ESCIMO, FaIR, GIR, GREB, Hector, Held et al. two layer model, MAGICC, MCE, OSCAR and WASP, we highlight the structural differences across various RCMs and show that RCMs are capable of reproducing global-mean surface air temperature (GSAT) changes of ESMs and historical observations. We find that some RCMs are capable of emulating the GSAT response of CMIP6 models to within a root-mean square error of 0.2 C (of the same order of magnitude as ESM internal variability) over a range of scenarios. Running the same model configurations for both RCP and SSP scenarios, we see that 10 the SSPs exhibit higher effective radiative forcing throughout the second half of the 21st Century. Comparing our results to the difference between CMIP5 and CMIP6 output, we find that the change in scenario explains approximately 46% of the increase in higher end projected warming between CMIP5 and CMIP6. This suggests that changes in ESMs from CMIP5 to CMIP6 1 https://doi.org/10.5194/gmd-2019-375 Preprint. Discussion started: 21 January 2020 c © Author(s) 2020. CC BY 4.0 License.


Introduction
In an ideal world, sufficient computing power would enable running our most comprehensive, physically complete models for 20 every application of interest. However, computational limits do exist so we must sometimes turn to other approaches. One common approach is the use of reduced complexity climate models (RCMs).
RCMs typically exchange limited spatial and temporal resolution for computational efficiency. Specifically, they usually focus on global-mean, annual-mean quantities. As a result, they are usually on the order of a million times faster than more complex models (in terms of simulated model years per unit CPU time). 25 The computational efficiency of RCMs means that they can be used where computational constraints would otherwise be limiting. Such cases include performing climate assessment for a large number of scenarios, exploring interacting uncertainties from multiple parts of the climate system or combining multiple lines of evidence in an internally consistent setup. In the IPCC context, a prominent example is the climate assessment of the WGIII socioeconomic scenarios. Given there are hundreds of emission scenarios submitted by Integrated Assessment Models in IPCC AR5 and AR6 (available at https://secure.iiasa.ac.at/ 30 web-apps/ene/AR5DB and https://tntcat.iiasa.ac.at/SspDb, hosted by the IIASA Energy Program) it is unfeasible to perform climate assessment with the world's most comprehensive models. Instead, reduced complexity models are used.
Beyond their computational efficiency, RCMs also offer conceptual simplicity. This gives them a second use: aiding in interpreting results from higher complexity models or observations. While we think this second use is also valuable (see e.g. Held et al. (2010) and Gregory (2004)), it is beyond the scope of this paper. 35 When RCMs are used to overcome computational constraints, they are typically used in one of two ways. 'Emulation' mode, where the RCMs are run in a setup which has been tuned to reproduce the behaviour of a Coupled Model Intercomparison Project (CMIP) (Eyring et al., 2016;Taylor et al., 2012) model as closely as possible over a range of scenarios. 'Probabilistic' mode, where the RCMs are run with a parameter ensemble which captures the uncertainty in historical observations. The resulting projections provide a plausible, observationally consistent range of projections which is large enough to provide 40 useful statistics. In some cases they are also used in a combination of the two. For example, the MAGICC6 probabilistic setup used in the IPCC's Fifth Assessment Report (IPCC, 2013) used randomly drawn emulations for the carbon cycle response whilst using a probabilistic parameter ensemble for the climate response to radiative forcing (Meinshausen et al., 2009(Meinshausen et al., ). 1997) investigated simple climate models used in the IPCC Second Assessment Report and compared their performance against idealised AOGCM results. Van Vuuren et al. (2011) compared RCMs used in integrated assessment models (IAMs). They focussed on five CO 2 -only experiments to quantify the differences in the behaviour of each IAM's climate component (each of which is an RCM due to computational constraints). Harmsen et al. (2015) extended the work of van Vuuren et al. (2011) to consider the impact of non-CO 2 climate drivers in the RCPs. Recently, Schwarber et al. (2019) proposed a series of impulse 70 tests for simple climate models in order to isolate differences in model behaviour under idealised conditions. RCMIP expands on the scope of previous work to include a broader range of scenarios and to make the first systematic comparison with both observations and CMIP model output.

Science questions
Over the course of its lifetime, RCMIP intends to answer the following scientific questions. This Phase 1 paper sets out the 75 protocols required to do so. However, given the scope of the questions, they cannot all be answered in a single paper and so some aspects will receive less attention here than others.
1. How do existing RCMs vary in their response to changes in well-mixed greenhouse gas (WMGHG) emissions, WMGHG concentrations, anthropogenic aerosol precursor emissions and natural changes in effective radiative forcing?
2. To what extent can RCMs emulate the response of more complex models from CMIP5 and CMIP6?  Table S3). The first class of these are the 'esm-X-allGHG' experiments. These runs are driven by emissions of all greenhouse gases (CO 2 , CH 4 , HFCs, PFCs etc.), rather than only CO 2 as is typical for ESMs in CMIP6.
These experiments are particularly useful to the WGIII community as they perform climate assessment based on emissions scenarios, hence need models which can run from emissions and do not require exogenous concentrations of greenhouse gases. 90 We also add one extra experiment, ssp370-lowNTCF-gidden, onto the ssp370-lowNTCF experiment from AerChemMIP.
The ssp370-lowNTCF experiment explicitly excludes a reduction in methane concentrations. However, the ssp370-lowNTCF emissions dataset as described in Gidden et al. (2019) and calculated in Meinshausen et al. (2019) does include reduced methane emissions and hence atmospheric concentrations. We include a 'ssp370-lowNTCF-gidden' scenario to complement ssp370-lowNTCF and examine the consequences of a strong reduction in methane emissions.

95
The standard set of inputs from CMIP5 and CMIP6 was translated into the RCMIP experiment protocol, using the WGIII format (Gidden and Huppmann, 2019), so it could be used by the modelling groups.
For CMIP6 historical emissions, we have used data sources which match the harmonisation used for the CMIP6 emissions projections. This ensures consistency with CMIP6, albeit at the expense of using the latest data sources in some cases. CMIP6 historical anthropogenic emissions for CO 2 , CH 4 , BC, CO, NH 3 , NOx, OC, SO 2 and volatile organic compounds (VOCs) 105 come from CEDS (Hoesly et al., 2018). Biomass burning emissions data for CH 4 , BC, CO, NH 3 , NOx, OC, SO 2 and volatile organic compounds (VOCs) come from UVA (van Marle et al., 2017). The biomass burning emissions are a blend of both anthropogenic and natural emissions, which could lead to some inconsistency between RCMs as they make different assumptions about the particular anthropogenic/natural emissions split. CO 2 global land-use emissions are taken from the Global further research. To facilitate such research, we have put the entire database under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Participating models
The models participating in RCMIP Phase 1 cover a broad spectrum of approaches (Table 1). In the climate response to radiative forcing, they range from two-box impulse response models to hemispherically resolved upwelling-diffusion ocean models and 130 in the carbon cycle they range from no carbon cycle to regionally resolved, internal book-keeping modules. Technical details of the models are summarized in Supplementary Table S1 and described below.

ACC2
The Aggregated Carbon Cycle, Atmospheric Chemistry, and Climate model (ACC2) (Tanaka et al., 2007) is a reducedcomplexity model developed from its predecessors (Hooss et al., 2001;Bruckner et al., 2003). ACC2 describes physical 135 and biogeochemical processes of the earth system at the global-annual-mean level and comprises three modules: i) carbon cycle, ii) atmospheric chemistry and radiative forcing, and iii) climate. The carbon cycle module is a four-box atmosphereocean mixed layer model coupled with a four-box land biosphere model. Each box is characterized by a distinct time constant, capturing carbon cycle processes operating on different time scales. Saturating ocean CO 2 uptake under rising atmospheric CO 2 concentration is dynamically modeled by considering the thermodynamic equilibrium for ocean carbonate species. The 140 5 https://doi.org/10.5194/gmd-2019-375 Preprint. Discussion started: 21 January 2020 c Author(s) 2020. CC BY 4.0 License.
CO 2 fertilization effect of the land biosphere is parameterized by the beta factor and assumed to depend logarithmically on the fractional change in the atmospheric CO 2 concentration relative to preindustrial. Carbon cycle processes are assumed to be insensitive to the change in surface temperature. ACC2 deals with the radiative forcing of CO 2 , CH 4 , N 2 O, O 3 , SF 6 , 29 species of halocarbons, three types of aerosols, and stratospheric H 2 O. The lifetime of CH 4 is related to the OH concentration, which further depends on the CH 4 concentration and NOx, CO, and VOC emissions. The climate module is the Diffusion 145 Ocean Energy Balance Climate Model (DOECLIM) (Kriegler, 2005;Tanaka et al., 2007), a land-ocean energy balance model coupled with a heat diffusion model describing the heat transfer to the deep ocean.
Uncertain parameters such as climate sensitivity and beta factor for CO 2 fertilization are optimized based on an inverse estimation method considering historical observations and associated uncertainties (Tanaka et al., 2009). The model is thus designed to produce a set of 'best' outcomes under different assumptions, but not probabilistic outcomes. In RCMIP, model 150 version 4.2 (Tanaka and O'Neill, 2018) is used.

AR5IR
The Fifth Assessment Report Impulse Response Model (AR5IR) was used for metric calculations in Chapter 8 of the IPCC's Fifth Assessment Report (Myhre et al., 2013b). Here we focus only on its temperature response to (effective) radiative forcing and do not incorporate any gas cycles. AR5IR is an impulse response model with two timescales by default. We also provide a 155 three timescale version for comparison.
AR5IR is provided in a default setup which broadly reproduces historical GSAT observations with an equilibrium climate sensitivity of 3 C. AR5IR has also been calibrated to the idealised CO 2 -only experiments (abrupt-2xCO2, abrupt-4xCO2, abrupt-0p5xCO2, 1pctCO2) of a number of CMIP6 models. Only these calibrations are reported as emulating CMIP6 models for experiments including non-CO 2 climate drivers requires model specific estimates of non-CO 2 radiative forcings, which 160 were not available during the calibration process. Throughout the calibrations, a default forcing due to a doubling of atmospheric CO 2 of 3.74 Wm 2 was chosen for simplicity. Using CMIP6 model specific values would not change results as it would simply cause all the calibrated parameters to be rescaled by a constant factor.

CICERO-SCM
The CICERO Simple climate model (CICERO-SCM) consists of an energy balance/upwelling diffusion model, a carbon cycle 165 model (Joos et al., 1996) and simplified expressions relating emissions of 38 components to forcing, either directly or via concentrations (Skeie et al., 2017). The energy balance/upwelling diffusion model (Schlesinger et al., 1992) calculates warming separately for the two hemispheres and includes 40 vertical layers in the ocean. Heat is transported in the ocean by down welling of polar water and by diffusion. A posteriori distribution of the key parameters describing the mixing processes as well as the climate sensitivity and aerosol forcing has been estimated using observations of atmospheric temperature change and ocean 170 heat content (Skeie et al., 2018).
The default model version used in this study has a climate sensitivity of 2.0 C and the rest of the parameters and the aerosol forcing is set to the mean value of the ensemble members with climate sensitivity in the range of 1.95 to 2.05 C from Skeie et al. (2018). For the model version with 3.0 C, the same approach is done for climate sensitivity in the range of 2.9 to 3.1 C.
A detailed description of the CICERO-SCM is presented by Skeie et al. (2017). Here, new expressions of CO 2 , CH 4 and 175 N 2 O radiative forcing (Etminan et al., 2016) are implemented in the SCM and indirect aerosol forcing is linearly scaled with SO 2 emissions rather than logarithmic as previously specified in the model. For the emission driven simulations, the time development of the natural emissions of N 2 O and CH 4 are adjusted to match the observed historical concentration. The mean value of the natural emissions for the last 10 years is used for the future period. recreating the broad outline of the global climate history from 1850 to 2015. ESCIMO is also able to recreate the projections of global mean temperature GMST and other variables such as ice cover, ocean acidification, heat flow to the deep ocean, and carbon uptake in biomass, which are generated by the more complex climate models over the commonly used RCP scenarios.

ESCIMO
One feature of ESCIMO is that it generates run-away warming if global-mean land-ocean blended surface temperature (GMST) is allowed to rise more than 1 C relative to preindustrial times. However, the ensuing warming is slow and takes hundreds of 190 years to lift the GMST by 3 C.

FaIR
The Finite-amplitude Impulse Response (FaIR) model is an emissions-driven simple climate model written in Python. FaIR v1.0 was originally designed to model the state-dependent carbon cycle with accumulated atmospheric carbon and temperature to CO 2 experiments (Millar et al., 2017) based on the impulse response relationship in IPCC AR5 (Myhre et al., 2013b), which 195 is in turn built upon pulse-response experiments from Earth system models of full and intermediate complexity (Joos et al., 2013). The radiative forcing is coupled with a two time-constant global mean temperature response (Geoffroy et al., 2013).
FaIR was extended to include emissions of non-CO 2 greenhouse gases and short-lived climate forcers in v1.3 (Smith et al., 2018a), reporting 13 categories of (effective) radiative forcing using simple emissions to forcing relationships available from AR5 and later (Etminan et al., 2016;Stevens, 2015;Ghan et al., 2013;Myhre et al., 2013b, a;Stevenson et al., 2013). FaIR 200 v1.5 used in this analysis includes modules for inverting CO 2 concentrations to emissions and additional functionality required to run many of the experiments in RCMIP.
For the default case, a representative TCR of 1.7 C was chosen. Owing to tropospheric rapid adjustments (Smith et al., 2018b), an effective radiative forcing from a doubling of CO 2 of 4.01 Wm 2 is used, higher than the 3.71 Wm 2 used in (Myhre et al., 2013b). Furthermore, all experiments in this paper were run using the older GHG concentrations to radiative formula. It then extends the carbon cycle model out to other gases through parameter selection. The thermal response model is identical to the thermal response of both FaIR v1.0 and v1.5; though we allow the user to specify the number of thermal response timescales, rather than constraining the model to the usual two boxes. The model is therefore 6 equations in its entirety. The identical treatment of all greenhouse gases and aerosols (we find that aerosols can also be adequately represented 220 through appropriate parameter selection) allows the model to take advantage of Python's array calculation infrastructure. It is therefore extremely rapid to run hundreds of thousands of scenario and parameter ensembles with GIR.

GREB
The Globally Resolved Energy Balance (GREB) model, also referred to as the Monash Simple Climate Model (MSCM) (Dommenget et al., 2019), is a 3.75 x 3.75 global model on three vertical layers: surface, atmosphere and subsurface ocean. It was 225 originally developed to simulate the globally resolved surface temperature response to a change in CO 2 forcing, but is also used to study other elements of the mean climate state. The main physical processes that control the surface temperature tendencies are simulated: solar (short-wave) and thermal (long-wave) radiation, the hydrological cycle (including evaporation, moisture transport and precipitation), horizontal transport of heat and heat uptake in the subsurface ocean. Atmospheric circulation and cloud cover are a seasonally prescribed boundary condition, and state-independent flux corrections are used to keep the GREB 230 model close to the observed mean climate. Thus, GREB does not simulate the atmospheric or ocean circulation and is therefore conceptually very different from GCM simulations.
The model does simulate important climate feedbacks such as water vapour and ice-albedo feedback, but an important limitation of the GREB model is that the response to external forcings does not involve circulation or cloud feedbacks (Dommenget and Flöter, 2011). The latest version (GREB v1.0.1) includes a new hydrological cycle model, which rep-235 resents a simple and fast tool for the study of precipitation from a large-scale climate perspective, as well as to assess its response to climate variability (e.g. El Niño or climate change) and to external forcings. Three separate parameterisations are improved in the model: precipitation, evaporation and the circulation of water vapour (Stassen et al., 2019). A series of GREB experiment results and more introduction can be simply accessed on the official model website at http: //monash.edu/research/simple-climate-model/mscm/overview_i18n.html?locale=EN. The last stable version code is available on GitHub at https://github.com/christianstassen/greb-official/releases. For RCMIP scenarios, the model is run with all processes on. The climatology used for flux correction is taken from ERA-Interim Reanalysis IFS model (ECMWF_IFS) within a time window from 1979 to 2015. Other parameters and input variables (not initial conditions), are taken from either ERA-Interim or NCEP Reanalysis data.

245
Hector is an open-source, object-oriented, reduced-form global climate carbon-cycle model that includes simplified but physicallybased representations of the most critical global-scale Earth system processes (Hartin et al., 2015;Vega-Westhoff et al., 2019).
At each annual time step, Hector calculates a global change in radiative forcing from changes in atmospheric composition, and then passes this forcing to its climate core. The carbon cycle consists of three parts: a one-pool atmosphere, three-pool land, and four-pool ocean. The three terrestrial carbon cycle pools -vegetation, detritus, and soil -exchange carbon with the 250 atmosphere through primary production (which increases with atmospheric CO 2 concentration) and heterotrophic respiration (which increases with temperature). Hector actively solves the inorganic carbon system in the surface ocean, directly calculating air-sea fluxes of carbon and ocean pH. A basic representation of thermohaline circulation drives heat and carbon exchange between low-and high-latitude surface ocean pools, an intermediate ocean pool, and a deep ocean pool. Besides CO 2 , Hector also includes a basic representation of the atmospheric chemistry of halocarbons, O 3 , NOx, and CH 4 . The climate core of 255 Hector is based on DOECLIM (Kriegler, 2005;Tanaka et al., 2007) and involves an analytical solution to ordinary differential equations governing heat exchange between the atmosphere and land, as well as a one-dimensional ocean heat diffusion model.
Hector is written in C++ and can be run as a standalone executable, coupled to an integrated assessment model, as well as via a Python (Willner et al., 2017)    The Held two layer model is provided in a default setup which broadly reproduces historical GSAT observations with an equilibrium climate sensitivity of 3 C. It has also been calibrated to the idealised CO 2 -only experiments (abrupt-2xCO2, abrupt-4xCO2, abrupt-0p5xCO2, 1pctCO2) of a number of CMIP6 models. Only these calibrations are reported as emulating CMIP6 models for experiments including non-CO 2 climate drivers requires model specific estimates of non-CO 2 radiative forcings, which were not available during the calibration process. Throughout the calibrations, a default forcing due to a 275 doubling of atmospheric CO 2 of 3.74 Wm 2 was chosen for simplicity. Using CMIP6 model specific values would marginally improve the fits and the correspondence between fitted parameters and ESM properties.

MAGICC
MAGICC, the 'Model for the Assessment of Greenhouse Gas Induced Climate Change' has its origins in early upwellingdiffusion ocean parameterisations by Wigley and Schlesinger (1985), with components for sea level rise (Wigley and Raper,280 1987) and a carbon cycle being added subsequently (Wigley, 1991). MAGICC has been used frequently in all past IPCC reports, often in conjunction with the ISAM (Jain et al., 1994) and Bern-CC (Siegenthaler and Joos, 1992)  For the calibration to individual CMIP6 models, three key advances have been made since MAGICC7.0.0. Firstly, a land 290 heat capacity term was included. Previously, only the ocean provided heat capacity in the energy balance model with ocean heat uptake equalling the parameterised radiative flux imbalance at the top of the atmosphere. Secondly, the parameterisation of the sensitivity of climate sensitivity to the overall forcing magnitude has been adapted. Previously, this sensitivity was implemented as an independent scaling of land and ocean feedback parameters (Meinshausen et al., 2011a). In MAGICC7.1.0, there are two parameterisations that allow for a time-changing effective climate sensitivity -apart from the geometric effect 295 that arises from constant land and ocean feedback parameters but time-changing ratios of land and ocean warming. The first is an option for the calibration routine to allow a linear adjustment of the climate sensitivity with a forcing change, pegged at doubled CO 2 concentration. This immediate adjustment of the climate sensitivity is independent from the second one, which allows for a slow and inert response, such as to parameterise the slow change of albedo with melting ice caps. This second sensitivity adjustment is assumed to be proportional to the cumulative temperature over a window of several hundred years.

300
Given the wide range of different experiments in the CMIP6 archives, ranging from highly idealized scenarios to multi-gas 21 st century results, MAGICC's calibration routines (which calibrate to all available scenarios, both idealised, and multi-gas scenarios at once) can differentiate between immediate and inert climate sensitivity adjustment terms.

MCE
MCE, a minimal CMIP emulator, is a tool for diagnosing and emulating the global behavior of complex AOGCMs with 305 minimally required parameters for both simplicity and accuracy. It consists of a thermal response module and a carbon cycle module. The thermal response, described in Tsutsui (2017), is represented by the sum of three exponentials as an impulse response function to effective radiative forcing, assuming a constant climate feedback parameter. The forcing model uses logarithmic scaling in terms of atmospheric CO 2 concentrations and its amplification from the first to the second doubling.
Non-CO 2 forcing is not explicitly considered. The carbon cycle is based on Hooss et al. (2001) and Joos et al. (1996) and  . The carbon cycle parameters are crudely calibrated to historical carbon balance assessed in AR5 (Ciais et al., 2013), and mean properties of CMIP5 ESMs (Arora et al., 2013), and are fixed across all model configurations. The 320 TCRE-based median calibration is close to the current default configuration, which is defined as a CMIP5 mean.

OSCAR
OSCAR v3.0 is an open-source reduced-form Earth system model, whose modules mimic models of higher complexity in a probabilistic setup (Gasser et al., 2017a). The response of the global surface temperature to radiative forcing follows a two-box model formulation (Geoffroy et al., 2013). OSCAR calculates the radiative forcing caused by greenhouse gases (CO 2 , CH 4 , 325 N 2 O, 37 halogenated compounds), short-lived climate forcers (tropospheric and stratospheric ozone, stratospheric water vapor, nitrates, sulfates, black carbon, primary and secondary organic aerosols) and changes in surface albedo. The ocean carbon cycle is based on the mixed-layer response function of Joos et al. (1996), albeit with an added stratification of the upper ocean derived from CMIP5 (Arora et al., 2013) and with an updated carbonate chemistry. The land carbon cycle is divided into five biomes and five regions, and each of the 25 biome/region combinations follows a three-box model (soil, litter and vegetation). The pre-330 industrial state of the land carbon cycle is calibrated to TRENDYv2 (Sitch et al., 2015) and its transient response to CO 2 and climate is calibrated to CMIP5 (Arora et al., 2013). Land cover change, wood harvest and shifting cultivation are also accounted for, thanks to a dedicated book-keeping module that allows OSCAR to estimate its own CO 2 emissions from land-use change (Gasser et al., 2017b). Permafrost thaw and the consequent emission of CO 2 and CH 4 is also modeled (Gasser et al., 2018).
CH 4 emissions from wetlands are calibrated on WETCHIMP (Melton et al., 2013). In addition, biomass burning emissions 335 are calculated endogenously following the book-keeping module and the wildfire feedback. These emissions were therefore subtracted from the input RCMIP data used to drive OSCAR to avoid double counting. The atmospheric lifetimes of non-CO 2 greenhouse gases are impacted by non-linear tropospheric (Holmes et al., 2013) and stratospheric (Prather et al., 2015) chemistry. Tropospheric ozone follows the formulation by Ehhalt et al. (2001) but recalibrated to ACCMIP (Stevenson et al., 2013). Stratospheric ozone is derived from Newman et al. (2006) and Ravishankara et al. (2009). Aerosol-radiation interactions 340 are based on CMIP5 and AeroCom2 (Myhre et al., 2013a), and aerosol-cloud interactions depend on the hydrophilic fraction of each aerosol and follow a logarithmic formulation (Hansen, 2005). Surface albedo change induced by land-cover change follows Bright and Kvalevåg (2013). The one induced by deposition of black carbon on snow is calibrated on ACCMIP globally (Lee et al., 2013) and then regionalized (Reddy and Boucher, 2007).
In this project, 10000 elements of a Monte-Carlo ensemble were generated, and each simulation was run using all these 2.65 ± 0.46 C. We did not further constrain our ensemble.

WASP
The Warming Acidification and Sea level Projector (WASP) is an efficient Earth system model capable of producing a large ensemble of simulations on a standard desktop computer. WASP comprises an 8-box representation of the coupled atmosphereocean-land system (Goodwin, 2016). The atmosphere comprises a single box connected to the land and ocean systems. The

355
land system comprises a vegetation carbon box with CO 2 and temperature dependent net primary production, and a soil carbon box with temperature dependent carbon respiration. The ocean comprises a surface mixed layer box, exchanging heat and carbon with 4 sub-surface boxes (Goodwin, 2016). The exchange of carbon dioxide between the atmosphere and surface ocean boxes is solved via a buffered carbon inventory approach (Goodwin et al., 2014), avoiding the computational expense of iteratively solving seawater carbonate chemistry. Global mean surface warming is solved via the extended energy balance 360 equation of (Goodwin et al., 2014). The version of WASP used here adopts time-evolving climate feedbacks as described in Goodwin (2018), with separate climate feedback terms representing processes such as cloud feedback, water vapour-lapse rate feedback and surface albedo feedback. The WASP model is coded in C++, with the code for the version used here available from the supplementary material of Goodwin (2018).
Initially, a 10 million member ensemble is generated with the values of 21 input parameters varied independently between 365 simulations. The 21 input parameter distributions are as described in Goodwin (2018), except that here: (i) the surface ocean carbon uptake e-folding timescale is varied from a uniform random distribution between the limits of 0.5 years and 1.0 years, and (ii) the sensitivity of soil carbon residence timescale to temperature is varied from a random-normal distribution with mean -1.36 yr K -1 and standard deviation 0.45 yr K -1 , to match the variation seen in Dynamic Global Vegetation Models used in Pugh et al. (2018). Each simulation is integrated from years 1765 to 2018 using historical forcing. All 10 million initial ensemble members are then assessed against an observational consistency test (Goodwin, 2018) using the methodology described in Goodwin et al. (2018). The observational consistency test comprises assessment of global surface warming, sea surface temperature warming, ocean heat uptake, ocean carbon uptake and land carbon uptake over various historical periods, resulting in observation consistent constraints on the ranges of physical climate feedbacks Goodwin (2018)  with which to extract the final ensemble, the final observation-consistent model ensemble generated from the ssp585 scenario is used. Due to the observational constraints used for surface warming, the WASP model's primary global surface temperature anomaly variable represents a blended land and ocean temperature (GMST). To calculate surface air temperature (GSAT) for the RCMIP results, the GMST is divided by 0.95 (Cowtan et al., 2015). This suggests a clear area for further evaluation of RCMs and has important implications for remaining carbon budget estimates, which are highly sensitive to estimates of recent warming trends (Leach et al., 2018). Discrepancies in the carbon cycle response to emissions also impact remaining carbon budget estimates, and further analysis on the all-greenhouse gas 400 emissions driven runs would highlight further model differences.

Results and Discussion
The discrepancies between RCMs are of a similar order of magnitude to observational uncertainties (Morice et al., 2012).
The spread of RCMs is also much smaller than the spread in available CMIP6 model results (warming of 1.13 ± 0.3 C and 13 https://doi.org/10.5194/gmd-2019-375 Preprint. Discussion started: 21 January 2020 c Author(s) 2020. CC BY 4.0 License. warming rate of 0.24±0.08 C / decade when only the first available ensemble member from each model group is used). Given their simple, tuneable nature, it is no surprise that RCMs tend to agree more closely with observations than CMIP models and 405 exhibit less spread.
In general, RCMs can emulate CMIP6 surface air temperature change relatively well (Figure 2). Given that RCMs do not include internal variability, the lower bound for root-mean square error (RMSE) is of the same order of magnitude as internal variability in CMIP6 models. The best emulators are pushing this limit, with RMSE on the order of 0.2K ( Table 2). As CMIP6 results have only recently become available, we expect further calibration efforts to reduce RMSE even further.

410
Despite their relatively good performance, results for the different emulation setups have generally only been submitted for a limited set of scenarios (see Supplementary Table S2 and Supplementary Figures S7 -S33). Hence it is still not clear whether the good performance in idealised scenarios also carries over to projections, particularly for the SSPs. Having said this, results for MAGICC7.1.0, which has supplied projections for the SSPs for each emulation setup, are promising. MAGICC7.1.0's results suggest that RCMs should be close to the lower limit of RMSE as more CMIP6 results become available and further 415 calibration efforts are carried out.
From the available results, differences emerge between models with constant effective climate sensitivities and models with time or state dependent effective climate sensitivities. Models with constant effective climate sensitivities, such as the AR5IR implementations, struggle to capture the non-linear response of ESMs to abrupt changes in CO 2 concentrations. Firstly, they predict an equally large response to negative radiative forcing as positive radiative forcing which isn't always the case in ESMs 420 (Panels (d) and (e) of Figure 2). Secondly, in order to capture the long-term warming seen in many abrupt-4xCO2 experiments, one of the boxes often has a response timescale on the order of thousands of years. This is problematic because it leads to equilibration times on the order of thousands of years and large equilibrium responses in the abrupt-2xCO2 experiments (i.e. large equilibrium climate sensitivities, see Supplementary Table S2). Models with time or state dependent effective climate sensitivities avoid both those problems. In particular, their temperature response to positive and negative radiative forcing 425 need not be of equal magnitude and they can exhibit the long warming tail seen in ESM abrupt-4xCO2 runs whilst avoiding extremely long equilibration times and large equilibrium temperature perturbations in abrupt-2xCO2 runs.
Probabilistic projections from RCMs illustrate the large range of plausible temperature projections resulting from physical parameter sets which are consistent with observations ( Figure 3). For example, under the very ambitious mitigation scenario ssp119, the models presented here have a best estimate of approximately 1.5 C for end of century warming. However, they 430 also suggest that there is still approximately a 1 in 6 chance that warming would exceed 2 C.
These probabilistic projections extend the results of CMIP6, which do not include such large perturbed parameter ensemble plus constraining exercises. The 66% ranges presented here are, in general, significantly narrower than the CMIP6 intermodel spread. There is no requirement that CMIP6 results lie within some range of historically observed temperature changes but the difference suggests that some caution should be used when inferring projection uncertainty from CMIP6 results alone.

435
Four of the models (MCE, WASP, FaIR and OSCAR) provide remarkably similar median projections. On the other hand, Hector projects significantly smaller surface air temperature increases, likely due to its lower CO 2 radiative forcing estimates ( Figure S2).
On the other hand, there is a surprising amount of variation in probabilistic simulations of the historical period. The variation in ranges, from MCE with relatively large ranges, to WASP and Hector with much smaller ranges, likely reflects differences 440 in constraining techniques. Given that variations in both model structure and calibration technique influence probabilistic projections, an area for future research could be to try and disentangle the impact of these two components. Such an experiment could involve constraining models with the same constraining technique or constraining a single model with two different techniques.
One other area for future research is the impact of the reference period. Within the reference period, all model results and 445 observations will be artificially brought together, narrowing uncertainty and disagreement within this period (Hawkins and Sutton, 2016). This can alter conclusions as the reference period will become less important for any fitting algorithm (because of the artificial agreement), placing more weight on other periods. Developing a method to rebase both the mean and variance of model and observational results onto other reference periods would allow the impact of the reference period choice to be explored in a more systematic fashion.

450
Looking forward, it is clear that making probabilistic projections consistent with CMIP6 results requires structural model flexibility, such that models are shown to be able to reproduce CMIP6 results. Having achieved this, probabilistic parameter ensembles can then be derived while considering uncertainty in both CO 2 and non-CO 2 climate drivers. During RCMIP Phase 1, only Hector has been able to perform both these steps. However, we hope that more models will be able to perform these steps in further phases. Such results would enhance our understanding of the uncertainty in observationally consistent climate 455 change projections and hence be of interest to the climate research community and beyond.
When run with the same model, warming projections are higher in the SSPs than the RCPs (Figure 4). Whilst historical warming estimates are very similar, if not slightly higher in the RCP-compatible historical runs, the scenarios separate over the course of the 21 st Century.
For the RCMIP results, we can see that the increase in warming projections is due to the higher effective radiative forcing 460 in the SSPs throughout the second half of the 21st Century (Supplementary Figure S3). The higher effective radiative forcing results appears to be a result of the SSPs agreeing more closely with their nameplate 2100 radiative forcing level than the RCPs, which were generally too low. The increased forcing is driven largely by increased CO 2 effective radiative forcing (Supplementary Figure S4), which itself is driven by increased CO 2 emissions (Gidden et al., 2019). Even though aerosol effective radiative forcing is also slightly more negative in the SSPs (Supplementary Figure S5), the difference of approximately 465 0.1 Wm 2 is not enough to offset increased CO 2 effective radiative forcing of approximately 0.5 Wm 2 .
At present, effective radiative forcings diagnosed from the CMIP6 models are not available (as such diagnosis is not a trivial task). However, given the monotonic relationship between CO 2 concentrations and effective radiative forcing (Myhre et al., 2013b), it is likely that the same mechanisms are driving at least part of the increase between CMIP5 and CMIP6 projections.
Comparing warming projections between CMIP5 and CMIP6, our results suggest that around 46% of the increase is scenario 470 driven. However, this still leaves 54% which is not explained by the change in scenarios. At this stage, this residual is most likely explained by a change in the models submitting results to CMIP, which appear to be more sensitive to changes in atmospheric GHG concentrations in CMIP6 than in CMIP5 (Wyser et al., 2019;Voldoire et al., 2019;Voosen, 2019). However, CMIP6 analysis is ongoing and should be considered before making strong conclusions about the robustness of these findings.
As discussed previously, the results from RCMIP can provide much more information than has been presented here. A 475 number of experiments have not been discussed here which would shed light on the differences between the RCMs in a number of other components. In addition, RCMs also offer the chance to explore more experiments than is planned in CMIP6 due to their computational efficiency. An experiment which is an example of both these points is the ssp370-lowNTCF scenario as quantified by Gidden et al. (2019), which includes reductions in methane emissions. In contrast, the ssp370-lowNTCF as defined by AerChemMIP explicitly includes methane emissions reductions. RCMs can examine the impact of this difference 480 by running an extra experiment, 'ssp370-lowNTCF-gidden', which follows the emissions quantified by Gidden et al. (2019).
Preliminary results are given in Supplementary Figure S6 and, unsurprisingly, show that surface air temperatures rise in ssp370-lowNTCF (relative to ssp370) whilst they fall in ssp370-lowNTCF-gidden. This fall in temperatures is driven entirely by reductions in methane emissions and users of CMIP6 data should be careful not to confuse the results of the ssp370-lowNTCF scenario with the emissions scenarios presented in Gidden et al. (2019). They are two different experiments.  490 We have found that RCMs are capable of reproducing broad-scale characteristics of observed historical GSAT changes as well as the response of ESMs under various experiments. Further work could focus on why RCMs exhibit relatively high recent warming rates compared to observations and using the ever growing body of CMIP6 results to improve RCM emulation capabilities. Nonetheless, there is clear evidence that the addition of time and state-dependent climate feedbacks in many RCMs has improved their ability to emulate the behaviour of more complex models under a range of forcing conditions.

495
Probabilistic projections from RCMs complement higher complexity model results by providing uncertainties which are by design consistent with historically observed temperature changes. Further evaluating these probabilistic distributions and the impact of different derivation techniques and model structures is a clear next step for the RCM community. Another next step is adding more models which are both calibrated to CMIP6 results and have probabilistic distributions as only the Hector model has managed this to date.

500
Phase 1 paves the way for further phases of RCMIP. Much of the work of defining community standards, data handling practices and communication methods has been established and now only needs refining. Further phases of RCMIP could focus on many different themes, for example, considering a wider range of variables, probabilistic climate projections (something which cannot be done with more complex models due to computational expense), specific components of the earth system (e.g. ocean heat content, representation of aerosols, sea-level rise) or model development (e.g. adding new components to models).
We would welcome requests, suggestions and further involvement from throughout the climate modelling research community.
Code and data availability. RCMIP input timeseries and results data along with processing scripts as used in this submission are available from the RCMIP GitLab repository at https://gitlab.com/rcmip/rcmip and archived by Zenodo (https://doi.org/10.5281/zenodo.3593570).
The ACC2 model code is available upon request.

535
were provided by JT. OSCAR results were provided by TG and YQ. WASP results were provided by PG. ZN wrote, except for the model descriptions, the first manuscript draft, produced all the figures and led the manuscript writing process with support from RG. All authors contributed to writing and revising the manuscript. thank the climate modeling groups for producing and making available their model output. RCMIP could not go ahead without the outputs of CMIP6 nor without the huge effort which is put in by all the researchers involved in CMIP6 (some of whom are also involved in RCMIP).
We also thank the RCMIP Steering Committee, comprised of Jan Fuglestvedt, Maisa Corradi, Malte Meinshausen, Piers Forster, Joeri Rogelj and Steven Smith, for their support and guidance throught Phase 1. We look forward to their ongoing support in further phases.
ZN benefited from support provided by the ARC Centre of Excellence for Climate Extremes (CE170100023). KT is supported by the