Minimal CMIP Emulator (MCE v1.2): A new simplified method for probabilistic climate projections

Climate model emulators have a crucial role in assessing warming levels of many emission scenarios from probabilistic climate projections, based on new insights into Earth system response to CO2 and other forcing factors. This article describes one such tool, MCE, from model formulation to application examples associated with a recent model intercomparison study. The MCE is based on impulse response functions and parameterized physics of effective radiative forcing and carbon uptake over ocean and land. Perturbed model parameters for probabilistic projections are generated from 10 statistical models and constrained with a Metropolis-Hastings independence sampler. A part of the model parameters associated with CO2-induced warming have a covariance structure, as diagnosed from complex climate models of the Coupled Model Intercomparison Project (CMIP). Although perturbed ensembles can cover the diversity of CMIP models effectively, they need to be constrained toward substantially lower climate sensitivity for the resulting historical warming to agree with the observed trends over recent decades. The model’s simplicity and resulting successful calibration imply that a 15 method with less complicated structures and fewer control parameters offers advantages when building reasonable perturbed ensembles in a transparent way. Experimental results for future scenarios show distinct differences between CMIPand observation-consistent ensembles, suggesting that perturbed ensembles for scenario assessment need to be properly constrained with new insights into forced response over historical periods.

The MCE has been used in both phases, and the present article provides details of the version used in Phase 2. The MCE model consists of prediction equations for thermal response and carbon cycle. Although there are many emulators with different complexities, their core modules appear to be based on a few pioneering works and are often shared between 65 different emulators. The thermal response of the MCE is implemented as a pure impulse response model (IRM), which is the most simplified form originated from the one presented in Hasselmann et al. (1993). The carbon cycle of the MCE is based on a part of the nonlinear impulse-response model of the coupled carbon-climate system (NICCS, Hooss et al., 2001), which may be categorized to be of intermediate complexity among RCMIP participants. One of them, ACC2 (Tanaka et al., 2007), also adopts the NICCS-based carbon cycle. 70 Although complex formulations generally are more capable of emulation, they are not necessarily advantageous for emulating individual CMIP models and representing their uncertainty ranges. For thermal response, this has been confirmed by the author's previous studies (Tsutsui, 2017;Tsutsui, 2020), which have demonstrated that a simple IRM can accurately emulate a variety of CMIP models in terms of temperature response to CO2 forcing and provide a basis of parameter sampling that covers model diversity. These findings also imply that less complex emulators are suitable for knowledge 75 transfer in a transparent way. From this perspective, key considerations for emulator design are in its subsidiary components, such as forcing parameterizations, treatment of non-linear processes involving some state-dependent response properties, and parameter constraining for probabilistic projections.
The remainder of this article is structured as follows. Section 2 describes model formulations and parameter sampling methods. Section 3 presents the experimental application of probabilistic climate projections. Section 4 discusses emulator 80 performance and constraining model parameters. Finally, Section 5 presents the study's main conclusions.

Impulse response models
The MCE model is essentially built on impulse response functions for the fraction of the total CO2 emitted that remains in the atmosphere (termed the airborne fraction), the decay of land carbon accumulated by the CO2 fertilization effect, and 85 temperature change to radiative forcing of CO2 and other forcing agents. Under the linear response assumption with regard to input forcing , an impulse response model (IRM) expresses the time change of a response variable by a convolution integral: where is time, and the sum of exponentials is an impulse response function with parameters and denoting the -th 90 component of the response amplitude and time constant, respectively. The time derivative of this equation is given by: https://doi.org/10.5194/gmd-2021-79 Preprint. Discussion started: 12 May 2021 c Author(s) 2021. CC BY 4.0 License.
or an equivalent box model form that is converted into the original IRM through Laplace transform or eigenfunction expansion. The time derivative implemented in the MCE uses an IRM form for land carbon decay and temperature change, and a box model form for the airborne fraction, to address partitioning of excess carbon between the atmosphere and ocean 95 mixed layer.
The IRM for the airborne fraction defines five components, one of which has infinity time constant, paired with an amplitude corresponding to an asymptotic long-term fraction. In the current configuration, the remaining four time constants are fixed at 236.5, 59.52, 12.17, and 1.271 years, adjusted to a specific three-dimensional ocean carbon cycle model in Hooss et al. (2001). The corresponding amplitudes assume perturbations at reference values of 0.24, 0.21, 0.25, and 0.1, respectively, 100 with a reference long-term airborne fraction of 0.20. These reference values and perturbation ranges are set empirically so that resulting carbon budgets-cumulative land and ocean CO2 uptake-agree with those of historical observations and CMIP experiments.
As described below, IRM parameters are converted into a set of parameters for an equivalent box model dealing with carbon exchange between four layers. In this conversion, the response of the shortest timescale component is interpreted as 105 equilibration between the atmosphere and ocean mixed layer, which are combined into a composite layer in the box model. Figure 1 shows response to an idealized instantaneous input of 100 GtC without land carbon uptake and climate feedback. In this case, the airborne fraction decreases from 0.9 to a long-term equilibrium of about 0.2 at a gradually decreasing rate. The IRM for land carbon defines four carbon pools, representing ground vegetation, wood, detritus, and soil organic carbon, with distinct overturning times ( ). The forcing term ( ) is net primary production (NPP) enhanced by the effect of CO2 fertilization, generally expressed by 0 , where is a fertilization factor that depends on the atmospheric CO2 115 https://doi.org/10.5194/gmd-2021-79 Preprint. Discussion started: 12 May 2021 c Author(s) 2021. CC BY 4.0 License. concentration, and 0 is base annual NPP in GtC per year. The response amplitude ( ) is rewritten as ̃, where ̃ denotes a decay flux after an initial carbon input. Based on Joos et al. (1996), the IRM parameters of the four carbon pools are set to 2.9, 20, 2.2, and 100 years for , and 0.70211, 0.013414, −0.71846, and 0.0029323 years −1 for ̃, respectively. In addition, the MCE deals with temperature dependency for the time constants of wood and soil organic carbon, indicating the tendency for warming to accelerate the decomposition of organic matter. This is one of the climate-carbon cycle feedback processes and is implemented with an adjustment coefficient varied along a logistic curve with respect to surface warming, as illustrated in Fig. 3. This scheme has a parameter to control the asymptotic minimum value of the coefficient. 125 The figure shows three curves with different control parameters, corresponding to the 17 th , 50 th , and 83 rd percentiles of the 'Prior' ensemble, described in 3.1, adjusted to be consistent with the variation of CMIP Earth system models (ESMs). In the IRM form, land accumulated carbon is proportional to ∑̃2, expressing the remaining carbon at an equilibrium state under unit continuous input, and the decrease in the time constants affects accumulated carbon quadratically.
130 https://doi.org/10.5194/gmd-2021-79 Preprint. Discussion started: 12 May 2021 c Author(s) 2021. CC BY 4.0 License. Figure 3: Adjustment coefficient as a function of surface temperature change to multiply the time constants for the decay of wood and soil organic matter. The three curves are functions corresponding to the 5 th , 50 th , and 95 th percentiles of the 'Prior' ensemble, described in 3.1, with different asymptotic minimum values, as described in the legend. The temperature at which a curve has the maximum gradient is fixed at 3.5 °C.
The IRM of the temperature change defines three components with typical time constants of approximately 1, 10, and > 100 135 years. Although the temperature response is usually well represented by two separated time constants of approximately 4 and > 100 years (e.g., Held et al., 2010, Geoffroy et al., 2013, dividing fast response is advantageous when considering instantaneous forcing changes, such as volcanic eruptions and geoengineering mitigation. The response amplitude is rewritten by ̃/ ( ), where ̃ is normalized so that the component sum is unity, and is the climate feedback parameter in Wm −2°C−1 , defined as the derivative of the outgoing thermal flux with respect to temperature change. These IRM parameters 140 can be adjusted to emulate individual CMIP models with sufficient accuracy, as demonstrated in Tsutsui (2020), which serves a basis to build a perturbed parameter ensemble.

Carbon uptake over ocean
The box model converted from the IRM for the airborne fraction is as follows: where is the amount of excess carbon in layer , ℎ is the layer depth, is the exchange coefficient between layer − 1 and layer , is anthropogenic emissions, and is natural uptake over land. The parameters ℎ and are set through 150 numerical optimization for the box model to be equivalent to the IRM form. The top layer, indexed with "0," is the composite atmosphere-ocean mixed layer, and the three sub-surface layers are indexed with "1," "2," and "3" in the order of ocean depth. The amount of excess carbon in the top layer ( 0 ) is partitioned into atmospheric and ocean components, denoted by and , subject to chemical equilibrium at the ocean surface. The carbon exchange between the top layer and the first sub-surface is expressed in terms of . 155 For a given time series of CO2 emissions (emission-driven) or atmospheric CO2 concentrations (concentration-driven), time integration is performed. In the latter case, 0 and its partition within the composite layer are diagnostically determined, and the top-layer equation is dropped.  The values of Alk, , and the equilibrium constants of 0 , 1 , 2 , , and are set based on Dickson et al. (2007). The 170 equilibrium constants depend on water temperature, and carbon uptake decreases with temperature, representing a climatecarbon cycle feedback process. This temperature dependency is implemented as a linear regression for empirical expressions, as shown in Fig. 4. The amount of excess carbon that can be accumulated in the ocean is proportional to a change in DIC from its preindustrial value. This carbon uptake potential and its temperature dependency are illustrated in Fig. 5. CO2-induced global warming increases the airborne fraction in two ways-through the buffering mechanism of seawater and through temperature 180 dependency of chemical equilibrium. The former is shown as a decreasing change rate of the DIC with regard to atmospheric CO2 concentration ( Fig. 5(a)), while the latter is shown as reduction in rates of carbon uptake potential with temperature, which also depends on the concentration ( Fig. 5(b)). The reduction rate is approximately proportional to the warming level, typically about 4 % per 1 °C at doubling CO2. 185 Figure 5: DIC in the ocean mixed layer as a function of atmospheric CO2 concentration (a) and changes in ocean carbon uptake potential, measured by increase in DIC from preindustrial levels, due to 1 °C and 2 °C warming (b). The preindustrial CO2 concentration is assumed to be 284.317 ppm and preindustrial DIC is about 2.17 mmol L −1 .

CO2 fertilization
The land carbon uptake term in Eq. (3) is calculated from Eq. (2), rewritten as: 190 where is the -th component of accumulated carbon by CO2 fertilization. The base NPP ( 0 ) is set to 60 GtC/y and the fertilization factor ( ) is formulated with a sigmoid curve with regard to CO2 concentration ( ) , as described in Meinshausen et al. (2011). This implementation is connected to a conventional logarithmic formula: such that the sigmoid and logarithmic curves are equal in terms of an increase ratio at 680 ppm relative to 340 ppm, and the latter factor ̂ is used as a control parameter. Figure 6 illustrates three curves with different control parameters in the MCE model.

Effective radiative forcing
The forcing term in the IRM for temperature change is assumed to be effective radiative forcing (ERF), defined as top-ofatmosphere (TOA) radiative imbalance due to a change in a forcing agent through rapid adjustments in the stratosphere and 205 troposphere prior to a response in surface temperature (Myhre et al., 2013, Sherwood et al., 2015. Forcing, defined as such, serves as a good predictor of surface temperature change.
CO2 forcing is evaluated with the following quadratic formula, in terms of the logarithm of CO2 concentration: where is the ratio of CO2 concentrations to a preindustrial level, C is a scaling parameter in Wm −2 , and C is an amplification factor defined as C (4) = C × � C (4). This scheme was presented in Tsutsui (2017) to emulate the CMIP's core CO2 increase experiments for instantaneous quadrupling and 1%-per-year increase, referred to as abrupt-4xCO2 and 1pctCO2, respectively. The two control parameters are diagnosed consistently with IRM parameters for individual CMIP models (Tsutsui 2020). The current diagnosing procedure solves numerical optimization to approximate the first 150-year 215 and 140-year time series from abrupt-4xCO2 and 1pctCO2 experiments, respectively, in terms of TOA energy imbalance and the surface air temperature anomaly, respectively. The quadratic term is activated when the concentration exceeds a two-times level ( > 2), and C is set to unity in the range ≤ 2 so that C is equivalent to � C . The forcing amplification is expected to be valid in the range ≤ 4 and the quadratic term is dropped beyond a four-time level. Figure 7 illustrates example outputs of the CO2 forcing scheme in a range of 5 th -95 th percentiles of the 'Prior' ensemble for control parameters.

225
The forcing of CH4 and N2O is evaluated with the expressions given in Etminan et al. (2016). The forcing of halogenated gases is simply calculated as changes in concentration from preindustrial levels multiplied by radiative efficiencies assessed in the latest IPCC report (currently AR5, Myhre et al., 2013). The current MCE model does not support non-CO2 gas cycles and ERF schemes for other forcing agents, such as aerosols, tropospheric and stratospheric ozone, solar radiation, and volcanic eruptions. Experiments considering non-CO2 forcing 230 require prescribed concentrations for long-lived greenhouse gases (GHGs) and prescribed ERF for others.

Parameter sampling
Probabilistic runs use an ensemble of perturbed model parameters designed to encompass the variation of multiple CMIP models with additional constraints with regard to assessed ranges of key climate indicators. In general, a series of candidate values of an uncertain parameter is generated from its statistical model and, if necessary, sampled from the series with an 235 acceptance algorithm for given constraints. The latter process is Bayesian updating from a prior probability distribution to a posterior and here uses a Metropolis-Hastings (MH) independence sampler. As mentioned above, uncertain parameters include IRM amplitudes for the airborne fraction, control parameters for land carbon decay timescales and CO2 fertilization, IRM parameters for temperature change, and control parameters for the CO2 forcing scheme.
The carbon cycle parameters are individually generated from a uniform distribution with a given mean and perturbation 240 range. The means and ranges are determined on a trial basis so that ranges of carbon budgets in historical and scenario experiments are consistent with those from multiple CMIP ESMs. Since the sum of IRM amplitudes for the airborne fraction is unity, their perturbed values are normalized as such, subject to a modified distribution with more samples about the mean resulting from the operation. The temperature response and CO2 forcing parameters are synthetically generated from a multivariate normal distribution 245 reflecting the variation of multiple CMIP AOGCMs. The IRM for temperature change has three pairs of time constant ( ) and dimensional amplitude ( ), and the CO2 forcing scheme has two control parameters ( C and C ). A total of eight parameters have been diagnosed for each of the multiple CMIP models, revealing characteristic covariance structures, such as a noticeable negative correlation between feedback strength (1/ ) and a realized warming fraction (typically TCR-to-ECS ratio), and a weakly negative correlation between the forcing scale ( C ) and feedback strength (Tsutsui, 2020). The 250 multivariate normal distribution is built on principal components (PCs) of these diagnosed parameters, as described in Tsutsui (2017).
The eight parameters to be fed into PC analysis can include some derived parameters from the following expressions: where ECS is defined using a diagnosed forcing of CO2 doubling, while ECSG uses CO2 quadrupling with a factor of 0.5 as in Gregory et al. (2004). Eq. (16) is derived from time integration of Eq.
(1) to the 70 th year ( 70 ) along a 1 %-per-year increasing path that defines TCR. One possible set consists of TCR, ̃0 /̃2, ̃1 /̃2, 0 , 1 , 2 , C , and C , applied with 260 logarithmic transformation, except for C . This set was adopted in the experiments described below. The logarithmic transformation is intended to allow fair normality of PC scores, as a basis for fitting a multivariate normal distribution, and to make generated candidates positive.
Probabilistic runs can also use different scaling factors to adjust individual non-CO2 ERF time series. This is a simple implementation to deal with non-CO2 forcing uncertainties, typically assessed as a range at a reference time point. The 265 scaling factor is perturbed with a suitable statistical model fitted to the range.
All uncertain parameters and ERF scaling parameters are not necessarily independent. The current sampling procedure incorporates covariance between the eight parameters relevant to temperature change in response to CO2 forcing. However, the procedure assumes no other correlations, implying that uncertainties of the CO2-induced temperature response are independent from those of the carbon cycle and non-CO2 forcing.

Scenario experiments
To demonstrate a typical application of the MCE model, a number of scenario experiments that mirror those of CMIP6 were conducted, including idealized abrupt-4xCO2 and 1pctCO2, and historical-future scenarios based on the Shared Socioeconomic Pathways (SSPs, Riahi et al., 2017). In the latter experiments, the model was initialized for the year 1850 and 275 driven with GHG concentrations and other prescribed ERF, both provided from the RCMIP (Nicholls et al., 2020).
For each scenario, two sets of 600-member ensemble runs were conducted; one was perturbed to be consistent with a CMIP multi-model ensemble and the other was further constrained according to the RCMIP Phase 2 protocol (Nicholls et al., submitted), here labeled 'Prior' and 'Constrained', respectively. 'Prior' refers to 25 CMIP5 and 38 CMIP6 AOGCMs for the PC analysis input, and to 8 CMIP5 and 11 CMIP6 ESMs diagnosed in Arora et al. (2020) for simulated carbon budgets in the 280 1pctCO2 experiment. Diagnosed forcing/response parameters of the multiple AOGCMs are presented in the MCE's code repository.
The uncertain carbon cycle parameters for 'Prior' were generated from the above-mentioned statistical models, as shown in Figs. 1, 3, and 6, and were processed by the MH sampler to constrain accumulated land carbon at doubling CO2 along the 1%-per-year pathway. In this case, 1pctCO2 scenario runs with a set of proposed parameters were conducted to obtain data 285 fed into the sampler. This single constraint was selected as it works inclusively for other relevant constraints through a tradeoff relationship between ocean and land in terms of accumulated carbon.
RCMIP Phase 2 defines a number of constraints for climate indicators, including ERF levels, carbon budgets, recent warming trends, and climate sensitivity metrics of ECS, TCR, and transient climate response to cumulative CO2 emissions (TCRE). Here, TCRE is defined as the ratio of the TCR to implied cumulative CO2 emissions at the time of CO2 doubling 290 along the same 1 %-per-year trajectory as that for TCR. These constraints use literature-based assessed ranges, referred to as a "proxy assessment" to distinguish these from the formal IPCC assessment. The 'Constrained' uncertain parameters were sampled from those of 'Prior' through a sequence of the MH sampler with a subset of RCMIP constraints, as follows: (1 (2020). In this case, in addition to 1pctCO2 runs, historical scenario runs with a set of proposed parameters were conducted to obtain data fed into the sampler.
The IRM for temperature change is transformed into a three-layer heat exchange model in physical space. When diagnosing the CO2 forcing and response parameters, the top layer temperature was treated as global mean surface air temperature 300 (GSAT). As in HadCRUT GMST was defined as a surface air ocean blended temperature change; here, a factor of 1.04 was used to convert observed GMST change into the MCE's GSAT change. Likewise, as the MCE's three layers cannot be https://doi.org/10.5194/gmd-2021-79 Preprint. Discussion started: 12 May 2021 c Author(s) 2021. CC BY 4.0 License. allocated to specific climate system components, a factor of 1.08 was used to convert observed OHC change into the MCE's total heat content change.
Besides CO2 forcing, the RCMIP constraints include the ranges of non-CO2 forcing over a historical period for CH4, N2O, 305 halocarbons (aggregated into "Montreal gases" (CFCs/HCFCs/halons) and other "F-gases" (HFCs/PFCs/SF6)), aerosols (aggregated), tropospheric ozone, stratospheric ozone, stratospheric water vapor from CH4, and albedo change due to land use and black carbon aerosols on snow and ice. Ranges are based on AR5 (Myhre et al., 2013), except for those for CH4 and aerosols, which consider recent updates (Etminan et al., 2016;Smith et al., 2020). To incorporate these uncertainty ranges in historical-future scenarios, the scaling factors to adjust individual non-CO2 ERF time series were perturbed using normal or 310 skewed normal distributions fitted to the prescribed ranges.
The RCMIP constraints are provided as likely ranges and optionally very likely ranges, corresponding to 17-83% and 5-95% according to the IPCC's likelihood terms in italics. These ranges were applied to generate uncertain parameter proposals and to build the MH sampler requiring probability densities for a target distribution.
Other details of experimental specifications are provided in the MCE's code repository. 315 Figure 8 illustrates relationships between key indicators associated with the carbon budget and climate sensitivity of the two ensembles in comparison with the CMIP models. The carbon budget is measured by the amount of accumulated carbon and its allocation to ocean and land reservoirs. Here, total accumulation and the ocean allocation ratio at doubling and quadrupling CO2 levels are used as key indicators. The CMIP ESMs indicate a clear negative correlation between the two 320 quantities ( Fig. 8 (a) and (b)), reflecting much greater uncertainties relating to land carbon. This feature is well represented by the MCE parameter ensembles. Although there are some model differences between CMIP5 and CMIP6 eras, such as a reduced model spread in the latter associated with nitrogen cycle implementation (Arora et al., 2020), the MCE ensembles currently do not distinguish between the two. The carbon indicators of the 'Constrained' ensemble do not differ significantly from those of 'Prior' but are distributed toward higher total accumulations, attributed to warming differences that affect 325 carbon cycle-climate feedbacks.

b) Same as panel (a), but in the 140 th year. (c) Effective radiative forcing (ERF) of CO2 doubling and climate feedback parameter. (d) Transient climate response (TCR) and equilibrium climate sensitivity diagnosed from abrupt-4xCO2 (ECSG). The dashed line is located where the ratio of TCR to ECSG is 0.6 as a reference.
In contrast, climate sensitivity differences are most prominent and well characterized with key indicators' distributions on 335 two-dimensional domains: the ERF of CO2 doubling derived from C vs. the climate feedback parameter ( ) (Fig. 8 (c)), and TCR vs ECSG (Fig. 8 (d)). While the 'Prior' distributions cover the CMIP AOGCMs effectively, the 'Constrained' are confined to lower sensitivity values-greater and smaller TCR and ECSG, attributed to the observed GMST and OHC constraints. The 'Prior' distribution of the CO2 forcing agrees well with the CMIP distribution, which shows a weakly positive correlation with the climate feedback parameter. In contrast, the 'Constrained' forcing levels are confined to an 340 upper half of the CMIP AOGCMs, attributed to the historical CO2 forcing constraint, and the forcing-feedback correlation becomes weak. Transient sensitivity is not necessarily proportional to equilibrium sensitivity, and greater CMIP6 sensitivity However, considerable uncertainties remain with regard to longer trends and unforced climate variability. In an earlier period, observed GMST was rather close to 'Prior', and the 'Constrained' trend appears underestimated. The OHC trend cannot be validated owing to its limited observation period. Assessment of forced response in the historical period, which is currently not available, would allow more reliable parameter sampling. The greater warming in 'Prior' is not only due to its greater climate sensitivity, but also partly due to non-CO2 forcing 360 differences, as shown in Fig. 10. The scaling factors of the non-CO2 forcing agents are independently perturbed in the 'Prior' ensemble and probabilistically selected through the series of the MH sampler. Although the sampling process does not directly refer to forcing levels of non-CO2 agents, it can modify their distributions to be consistent with other constraints. This modification is found for non-CO2 GHGs and ozone time series (Fig. 10 (b)), and the most dominant contribution is of Montreal-gases (not shown). The ERF of Montreal-gases rapidly increases from the 1960s and levels off from the 1990s, and 365 the sampling results imply that this tendency is not consistent with the recent warming trend. Total ERF fluctuates with https://doi.org/10.5194/gmd-2021-79 Preprint. Discussion started: 12 May 2021 c Author(s) 2021. CC BY 4.0 License. changes in solar irradiance and volcanic eruptions, for which the RCMIP's prescribed forcing was used without their efficacy uncertainties.

370
The ensembles' medians are shown by lines, and the 5-95% range of the 'Constrained' ensemble is shown for the total by shading. Figure 11 displays the ranges of climate indicators from the two ensembles associated with carbon cycle, climate sensitivity, warming trends, and historical ERF changes, in comparison with their proxy assessment ranges. The consistency between modeled and proxy ranges can be most distinctively shown for warming trends by GMST and OHC changes (Fig. 11 (k) and (l)), with 'Prior' ranges substantially wider and higher than assessed ranges but comparable 'Constrained' ranges. The 375 consistency of sensitivity indicators, including TCRE, (Fig. 11 (h)-(j)) is complex because the proxy assessment ranges (black error bars) themselves are not necessarily consistent with each other, as discussed in the next section, and narrowed from the AR5-assessed ranges (grey error bars). Overall, consistency is better for 'Prior' ranges, although 'Constrained' ranges, considerably narrowed and lowered, are still within the AR5-assessed ranges. The ranges of the carbon cycle indicators (Fig. 11 (a)-(g)), including accumulated carbon and implied cumulative emissions in the historical period 1750-380 2011, are not significantly different between the two ensembles and broadly agree with assessed ranges. Ensemble runs for the extended historical period starting from 1750 were conducted for calibration. The ranges of the ERF indicators (Fig. 11 (m)-(t)) are consistent with assessed ranges, except for 'Prior' CO2 and 'Constrained' Montreal gases, as mentioned above.
Other minor changes from 'Prior' to 'Constrained include a reduced range for aerosols and lowered ranges for stratospheric and tropospheric ozone.

395
respectively. The black and grey error bars indicate proxy assessment ranges and AR5-assessed ranges, respectively. The proxy ranges are based on 5-95% ranges of the CMIP Earth system models in (a)-(d), but otherwise taken from the RCMIP Phase 2 protocol that partly includes the AR5-assessed ranges. Figure 12 illustrates temperature response in two SSP scenarios, SSP1-2.6 and SSP2-4.5, where warming is measured by an 400 increase in global mean surface air temperature (GSAT) relative to 1850-1900, and the period up to 2100 is presented. In the scenario labeled 'SSPn-x.x', 'n' (1-5 numbers) identifies different socioeconomic development pathways, and 'x.x' expresses a nominal forcing level in Wm −2 at the end of the 21 st century or later. The shaded areas indicate 33-66% ranges.

Results: projected warming
The upper bound corresponds to the level to which warming is likely (66-100%) to be limited at the time, while the lower bound corresponds to the level which warming is likely to exceed. These thresholds are shown in Table 1 for peak and end-405 of-century (end-21C) warming for eight SSP scenarios, where the end-21C period is set to 2081-2100, in accordance with the AR5. SSP1-1.9 SSP1-2.6 SSP4-3.4 SSP5-3.4* SSP2-4.5 SSP4-6.0 SSP3-7.0 SSP5-8.5 Likely Regarding consistency with target warming levels, such as two degrees above preindustrial levels, the 'Constrained' ensemble agrees relatively well with the AR5 assessment (Collins et al., 2013) for each of the comparable Representative 420 Concentration Pathways (RCPs, van Vuuren et al., 2011) such as RCP2.6 with SSP1-2.6. For example, AR5 states that end-21C temperature change above 2 °C is unlikely (0-33%) under RCP2.6, which implies that temperature is likely limited to 2 °C. This assessment is consistent with the SSP1-2.6 result from 'Constrained' (likely limited to 1.51 °C) but not from 'Prior' (likely limited to 2.27 °C). Some threshold temperatures in 'Constrained' are not consistent with AR5, such that the temperature in SSP2-4.5 likely exceeds 2.06 °C, while in AR5 it is more likely than not (> 50-100%) to exceed 2 °C in 425 RCP4.5. There is a similar difference in the possibility of limiting to 4 °C in SSP5-8.5 and RCP8.5. AR5 assessed these cases with medium confidence rather than high confidence, implying that the reduced likely ranges (as in 'Constrained') can update the AR5 assessment more authentically. However, at present, the 'Constrained' ensemble does not incorporate possible uncertainties, as discussed in the next section. It should also be noted that SSP forcing is not exactly the same as corresponding RCP forcing, leading to noticeable temperature differences between the comparable scenarios (Nicholls et al., 430 2020).
There are also some issues with handling of historical warming. The AR5 refers to a specific level of 0.61 °C from HadCRUT data for the period 1986-2005, which is added to the CMIP5 projected warming. However, HadCRUT warming is defined as an air-ocean blended temperature and is thereby somewhat underestimated for the GSAT definition (Schurer et al., 2018) with which modeled future warming is evaluated. In any case, the AR5 assessment is effectively constrained by 435 observed warming, which may be responsible for its better agreement with the 'Constrained' ensemble.

Performance as an emulator
It has already been confirmed that the MCE reproduces many different CMIP models effectively in terms of thermal response to idealized CO2 forcing changes, as demonstrated in Nicholls et al. (2020). The forcing and response parameters are adjusted to emulate two of the CMIP's basic experiments for step-shaped (abrupt-4xCO2) and ramp-shaped (1pctCO2) 445 forcing increases. The forcing scheme uses different functions depending on concentration levels: a quadratic expression in terms of logarithmic concentrations in the range from two to four times the base level, smoothly connecting to linear expressions outside this range. This flexibility suits the CMIP models' tendency to deviate from logarithmic concentration proportions at higher concentrations, leading to better emulation not only for responses to quadrupling increases commonly used in basic experiments, but also for responses to considerably lower increases in many mitigation scenarios. 450 However, the scheme assumes constancy of the climate feedback parameter; emulation accuracy will therefore be decreased in scenarios where state dependency of feedbacks emerges. A typical example appears in a cooling scenario. The RCMIP Phase 1 results include a case in which the MCE fails to emulate a halving CO2 experiment, while successfully emulating both doubling and quadrupling (See Figure 2 of Nicholls et al., 2020). It is also known that state dependency becomes significant when the time integration of the step response continues over multi-centennial to millennial timescales (Knutti 455 and Rugenstein, 2015;Rohrschneider et al., 2019). As CMIP models tend to deviate from linearity between the TOA energy imbalance and the surface temperature anomaly so that additional warming occurs with time, the MCE would result in underestimated warming in such a case. In practice, this issue is not significant up to the time horizon of 2100, commonly used in mitigation scenarios, in particular for lower than doubling CO2 levels.
For non-CO2 forcing, additivity is assumed across different agents, except for overwrapping effects for CH4 and N2O, as 460 parameterized in Etminan et al. (2016). The forcing amplification for CO2 is not extended to total forcing. These are reasonable assumptions for most mitigation scenarios where non-CO2 components are presumably not extreme.
The carbon cycle module has a mixture of fixed and adjustable parameters, including those for several feedback mechanisms from temperature changes. The current configuration successfully works to represent the CMIP ESMs' ranges in terms of carbon budget in the idealized 1 %-per-year CO2 increase experiment. However, it has not yet been verified that each of the 465 ESMs can be accurately emulated.
Diagnosing the carbon cycle parameters to individual ESMs is a main issue to be addressed in future. Accumulated carbon in response to atmospheric CO2 input has a trade-off relationship between ocean and land, and both components have their own mechanisms of climate-carbon cycle feedbacks, which are also subject to the magnitude of temperature response. This implies that calibrating the MCE parameters for each ESM requires a series of pulse-response experiments designed to allow 470 each of the ocean and land contributions to be isolated, with and without temperature feedback. Besides the standard 1%-peryear increase experiment, the CMIP6 provides idealized ESM experiments, including 1%-per-year increase variants with different configurations and a variety of pathways to zero emissions (Jones et al., 2019;Keller et al., 2018). The extent to https://doi.org/10.5194/gmd-2021-79 Preprint. Discussion started: 12 May 2021 c Author(s) 2021. CC BY 4.0 License. which different ESMs are emulated for these scenarios needs to be verified with calibrated parameters, leading to further insights into carbon-cycle behavior in terms of amount of emissions, hysteresis effects after attaining zero emissions, and 475 state dependency.
While the covariance of MCE parameters is incorporated for the CMIP models' variability of CO2 induced warming, the carbon cycle parameters and the non-CO2 scaling factors are independently sampled. There may exist other covariance between key indicators, such as a correlation between CO2-induced warming and aerosol cooling implied from CMIP6 historical experiments (Meehl et al., 2020). As different types of aerosol schemes constitute a major source of model 480 variations, incorporating covariance associated with aerosol forcing would improve parameter sampling, leading to more appropriate indicator ranges.
The results shown in the previous section are outputs from concentration-driven experiments, where implied emissions are available for CO2 only. Likewise, the emission-driven option is currently limited to CO2. The two types of experiments are equivalent within numerical errors associated with a time integration scheme, for which Runge-Kutta 4 th order is used. 485 However, implied emissions tend to be noisy when pulse-like non-CO2 forcing is given, owing to the temperature dependency implemented in carbon cycle modules. This is the case in historical experiments including volcanic forcing.

Further improvement on constraints
The 'Constrained' ensemble was applied to that compared in the RCMIP Phase 2 exercise, where the MCE is recognized as one of two models that match relatively well the target constraints, among nine participant models with different degrees of 490 complexity (Nicholls et al., personal communication). The MCE is a relatively simple emulator, conceivably cited with a simple thermal response, an intermediate-complexity carbon cycle, simply parameterized non-CO2 GHG forcing, and no other Earth system components. This simplicity and the successful results obtained imply that a method with less complicated structures and fewer control parameters offers advantages when building reasonable parameter ensembles, despite less capacity to emulate detailed Earth system components. 495 Several issues require clarification with regard to the differences between 'Prior' and 'Constrained' ensembles.
It should be emphasized that the constraints used in the RCMIP are preliminary, wherein the formal IPCC Sixth assessment is not yet available (Nicholls et al., personal communication). The current proxy constraints such as the ones for the three climate sensitivity ranges of ECSG, TCR, and TCRE adopted from individual studies are not necessarily consistent with each other. The range of ECSG is based on multiple lines of evidence, including feedback process understanding, historical 500 records, and paleoclimate records (Sherwood et al., 2020). Here, ECSG, rather than ECS, is referred to, assuming that process understanding is largely based on the CMIP's quadrupling CO2 experiments. The range of TCR is based on 30 and 22 AOGCMs from the CMIP5 and CMIP6, both constrained by warming trends during recent decades (Tokarska et al., 2020).
In contrast to these observations and modeling studies, the range of TCRE is based on 11 CMIP6 ESMs (Arora et al., 2020).
Improved ensembles based on comprehensively assessed constraints would increase reliability of probabilistic projections, 505 leading to better insights into future warming. In comparison with the AR5 assessment, the 'Constrained' ensemble has considerably low-biased climate sensitivity, but nevertheless indicates comparable future warming across different scenarios. As stated above, this inconsistency can be partly explained on the basis of the AR5 method for warming levels that adds up observed historical warming to CMIP5modeled future projections. With regard to consistency throughout the whole period, the emulator approach would be more 510 desirable. In any case, it is necessary to impose appropriate weighting on CMIP models to be emulated, in particular, when the model ensemble has a wide spread and some outliers in terms of reproducibility of past and current climates (Cox et al., 2018;Tokarska et al., 2020). The MH sampler with observed warming constraints corresponds to an indirect method for such weighting. As the present results decisively depend on surface temperature and OHC observations during recent decades, their validity as a constraint needs to be discussed from a broad perspective across forcing, response, and sensitivity. 515 The current constraining assumes observed warming as entirely forced response. Recent findings from warming attribution studies may support this, suggesting that human-induced warming is similar to observed warming (Allen et al., 2018).
However, the attribution depends on temporal and spatial patterns of forced response in multiple AOGCMs as well as their diagnosed forcing, leading to a complicated situation in which constraining data and AOGCMs to be constrained are mutually dependent. Besides, there remain substantial uncertainties of response patterns to changes in individual forcing 520 factors owing to the diversity of AOGCMs (Jones et al., 2016). Moreover, the new CMIP6 models appear to have marked differences in the magnitude of internal variability underlying attribution studies (Parsons et al., 2020). The GMST constraint does not consider such uncertainties and may be replaced with broader ranges from new insights into forced response.
Furthermore, it is also necessary to constrain forced response on a centennial timescale. Besides the HadCRUT data 525 available since 1850, the OHC data, limited to the late 20 th century, may include delayed response components on a much longer timescale. In fact, major volcanic eruptions occurred frequently in the 18 th and 19 th centuries. The initial year of 1850 is commonly used as a proxy preindustrial time point to avoid difficulties that arise from limited observations and major eruptions (Allen et al., 2018). However, the pre-1850 volcanic impact on the deep ocean should be examined carefully. In fact, it has been recognized in historical runs with the MCE that the OHC increase during the late 20 th century significantly 530 depends on the initial year, while the surface warming does not. A series of volcanic eruptions in the period 1750-1850 appears to contribute to heating after the period and amplifies heat content increase since 1850. The results suggest that human-induced OHC increase should be distinguished from observed total increase for better constraints.
Aerosol cooling is one of the key factors influencing temperature changes in the latter half of the 20 th century, resulting in different ranges of other climate indicators. The present method relies on the prescribed ERF time series prepared for the 535 CMIP6, which is scaled to the proxy assessment range in 2014 (Fig. 11 (r)). Although this procedure assures that cooling magnitude is constrained to the given range at the specific time point, the base time series is fixed. The constraint, adopted from Smith et al. (2020), is the outcome of the Radiative Forcing Model Intercomparison Project (RFMIP, Pincus et al., 2016), one of the CMIP6-endorsed model intercomparisons. A better insight into aerosol forcing may update its historical time series, thereby leading to improvement of forced response and other climate indicators, including climate sensitivity.
Technical issues exist when sampling from the 'Prior' ensemble with observed warming constraints, associated with their distinct differences. The proxy ranges of the constraints are much more biased and localized compared to the 'Prior' distributions, leading to inefficient sampling. The acceptance rate in the present case reached only about 0.6 %, requiring over 100 k-member calibration runs to obtain a 600-member 'Constrained' ensemble. Besides the efficiency issue, the sampling process should be visually monitored to verify whether the acceptance/rejection is reasonable. As the MH 545 independence sampler compares a probability density ratio of the next state to the current between candidate and target densities, care is to be exercised at the distributions' tail regions where relatively large approximation errors may exist. The present method introduced ad hoc criteria to avoid acceptance with an unexpectedly large density ratio.

Conclusions
A new climate model emulator, MCE, was developed, and its probabilistic climate projections for representative scenarios 550 were demonstrated and thoroughly discussed. The MCE is based on impulse response functions and several parameterized physics including key climate-carbon cycle feedbacks and may be categorized as a relatively low complexity model among recent model intercomparison participants. It has an advantage when building reasonable perturbed ensembles transparently, despite its lower capacity to emulate detailed Earth system components. Perturbed ensembles can cover complex climate models' diversity effectively, reflecting their covariance structure of diagnosed forcing-response parameters associated with 555 CO2-induced warming. Probabilistic projections constrained with several ranges of climate indicators, including CO2 and other forcing factors and observed warming trends over recent decades, suggest that complex climate models generally overestimate climate sensitivity. The sampling procedure implemented for parameter constraining, based on a Metropolis-Hastings algorithm, effectively works as weighting given to complex models.
Results from climate assessments for future scenarios in terms of their compatibility with climate mitigation goals are 560 preliminary, and experiments should be conducted with newly assessed constraining data. There are considerable uncertainties about forced components of historical warming as well as different forcing factors, and consistency of assessed ranges among different climate sensitivity metrics. These are main issues to be clarified in the forthcoming new assessment.
There is some room for improvement in emulator functionality. The carbon cycle module has not been configured to individual complex models, full emissions-driven experiments have not been supported, and perturbed parameter ensembles 565 have not reflected full covariance structures of complex models. These issues are to be addressed in future work.