Overview of the Norwegian Earth System Model (NorESM2) and key climate response of CMIP6 DECK, historical, and scenario simulations

The second version of the coupled Norwegian Earth System Model (NorESM2) is presented and evaluated. NorESM2 is based on the second version of the Community Earth System Model (CESM2) and shares with CESM2 the computer code infrastructure and many Earth system model components. However, NorESM2 employs entirely different ocean and ocean biogeochemistry models. The atmosphere component of NorESM2 (CAM-Nor) includes a different module for aerosol physics and chemistry, including interactions with cloud and radiation; additionally, CAMNor includes improvements in the formulation of local dry and moist energy conservation, in local and global angular momentum conservation, and in the computations for deep convection and air–sea fluxes. The surface components of NorESM2 have minor changes in the albedo calculations and to land and sea-ice models. We present results from simulations with NorESM2 that were carried out for the sixth phase of the Coupled Model Intercomparison Project (CMIP6). Two versions of the model are used: one with lower (∼ 2) atmosphere–land resolution and one with medium (∼ 1) atmosphere–land resolution. The stability of the pre-industrial climate and the sensitivity of the model to abrupt and gradual quadrupling of CO2 are assessed, along with the ability of the model to simulate the historical climate under the CMIP6 forcings. Compared to observations and reanalyses, NorESM2 represents an improvement over previous versions of NorESM in most aspects. NorESM2 appears less sensitive to greenhouse gas forcing than its predecessors, with an estimated equilibrium climate sensitivity of 2.5 K in both resolutions on a 150-year time frame; however, this estimate increases with the time window and the climate sensitivity at equilibration is much higher. We also consider the model response to future scenarios as defined by selected Shared Socioeconomic Pathways (SSPs) from the Scenario Model Intercomparison Project defined under CMIP6. Under the four scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5), the warming in the period 2090–2099 compared to 1850–1879 reaches 1.3, 2.2, 3.0, and 3.9 K in NorESM2-LM, and 1.3, 2.1, 3.1, and 3.9 K in NorESM-MM, robustly similar in both resolutions. NorESM2-LM shows a rather satisfactory evolution of recent sea-ice area. In NorESM2-LM, an ice-free Arctic Ocean is only avoided in the SSP1-2.6 scenario. Published by Copernicus Publications on behalf of the European Geosciences Union. 6166 Ø. Seland et al.: NorESM2 – model description

minor changes which are summarised in Sect. 2.5 and 2.6. The river model is the Model for Scale Adaptive River Transport (MOSART; Li et al., 2013) and is identical to the version found in CESM2.1 and hence is not described here. The coupler 85 structure is retained as in CESM2.1 but with changes in flux and albedo calculations summarized below.
The interactive land-ice (the Community Ice Sheet Model; CISM; Lipscomb et al., 2019) and ocean surface wave components included in CESM2 were not activated in NorESM2 for the CMIP6 model integrations. Our tests with an interactive ice-sheet model over Greenland show that the model does not maintain a realistic mass balance, indicating that further development is needed. For CESM, specific tuning was carried out in order to achieve a better Greenland ice-sheet mass balance. 90 Although NorESM2 inherited such tunings, its warmer regional climate would have required additional, dedicated effort. Due to resource limitations we have postponed this until after CMIP6.

Model versions and the coupled model system
In view of the comparatively high computational cost of the model, two different versions of NorESM2 with different computational cost are presented. The two versions differ by the horizontal resolution of the atmosphere and land components. 95 The "medium-resolution" (M) version has a grid spacing of 1.25 • ×0.9375 • in these components, like CESM2 (Gettelman et al., 2019a). The "low-resolution" (L) version uses half that resolution in the atmosphere and land components. The ocean and sea ice components are run with "medium" (M) (1 • ) resolution in both versions. To facilitate distinguishing between the different resolutions when discussing set-up and results, a two letter suffix is added to NorESM2, "LM" for low-resolution atmosphere/land and medium resolution ocean/sea ice and "MM" for medium resolution of both atmosphere/land and ocean/sea 100 ice. NorESM2-LM is used for most of the CMIP6 simulations, while NorESM2-MM is only used for a limited number of experiments.

Atmosphere model, CAM6-Nor
The atmosphere model component of NorESM2 is built on the CAM6 version from CESM2.1, using the hydrostatic finitevolume dynamical core on a regular latitude-longitude grid at the two horizontal resolutions mentioned above. In the vertical, calculated online by the model have been updated. Anthropogenic emissions of black carbon (BC), organic matter (OM), and sulfur dioxide (SO 2 ) are prescribed according to Hoesly et al. (2018), and biomass burning emission strengths follow van Marle et al. (2017) applying a vertical distribution according to Dentener et al. (2006). As in NorESM1, continuous tropospheric outgassing of SO 2 by volcanoes is taken into account, but we have also added the tropospheric contribution of 120 explosive volcanoes (Dentener et al., 2006). As in NorESM1, an OM/OC ratio of 1.4 is taken for fossil fuel emissions and 2.6 for biomass burning emissions, and sulphur emissions are assumed to be 97.5 % SO 2 and 2.5 % SO 4 . Nitrate aerosol is not included. (iv) The impact of stratospheric aerosol in NorESM1 was taken into account by prescribing volcanic aerosol mass concentrations. In NorESM2, prescribed optical properties from CMIP6 are instead used, and are integrated in the calculation of total optical parameters for use in the radiation module together with other aerosols. The monthly distributions of stratospheric 125 sulfate aerosols follow now the CMIP6 recommendations (Thomason et al., 2018). (v) For oxidant concentrations (hydroxyl radical (OH), nitrate radical (NO 3 ), hydroperoxy radical (HO 2 ) and ozone (O 3 )) needed for the description of secondary aerosol formation, we use the same fields as used in CESM2(CAM6) (Danabasoglu et al., 2020), which originate from preindustrial control, historical, and scenario simulations of CESM2(WACCM6) (Gettelman et al., 2019b). The oxidant fields are 3-dimensional monthly varying fields, and are provided at a decadal frequency for the historical and scenario simulations 130 (Danabasoglu et al., 2020). (vi) For ozone concentrations used in the radiative transfer calculations we also use fields from CESM2(WACCM6). They are zonally averaged 5-daily varying fields. (vii) Production rates of H 2 O from CH 4 oxidation (mainly playing a role in the stratosphere) are also prescribed monthly climatologies based on CESM2(WACCM6) simulations, again with a decadal frequency.
In NorESM2, oceanic dimethyl sulfide (DMS) emission is prognostically simulated by the ocean biogeochemistry compo-135 nent (Sect. 2.4), hence allowing for a direct biogeochemical climate feedback in coupled simulations. The DMS air-sea flux is simulated as a function of upper-ocean biological production following the formulation of Six and Maier-Reimer (1996) and was first tested in the NorESM model framework by Schwinger et al. (2017). Currently the atmospheric deposition into the ocean is decoupled. The ocean biogeochemistry uses the monthly climatological aerial dust (iron) deposition of Mahowald et al. (2005). The dust parametrisation has undergone two important changes with respect to NorESM1. First, dust emissions survival probabilities from size distribution measurements and found that at least 80% of the nucleated particles measured at Atlanta, GA and Boulder, CO were lost by coagulation before the nucleation mode reached CCN sizes, even during days with high growth rates.
The equation for sea-salt emissions has been modified by changing their dependence on 10-meter wind speed. NorESM2 155 adopts the value recommended by Salter et al. (2015), 3.74, for the exponential factor, instead of 3.41 in NorESM1. This change was partly justified as an early tuning prior to the start of the spin-up simulations, in order to reduce the large positive top of the model radiative imbalance of the model before temperature equilibration. Even with the lower exponential factor however the model already produced excessive sea-salt aerosol optical depth (Gliß et al., 2020) and surface mass concentrations (Olivié et al., in prep.) compared to in-situ observations. Thus the change results in an even larger overestimate. "The sea-salt emission 160 changes were tested in a predecessor model version, NorESM1.2 (Kirkevåg et al., 2018). Annual and globally averaged, this lead to increases from 99.5 to 228.3 ng −2 s −1 (129%) in sea-salt emissions, from 7.8 to 17.2 mg m −2 (121%) in sea-salt column burdens, with corresponding changes in total clear-sky AOD from 0.086 to 0.119 (38%), and cloud droplet numbers at top of the cloud (using the method of Kirkevåg et al., 2018) changed from 31.3 to 32.7 cm −3 (4.5%). Since the emission flux of oceanic primary organic aerosols is proportional to that of fine sea-salt aerosols (Kirkevåg et al., 2018), this specific change 165 also has an impact on the natural oceanic organic matter emissions.
CAM-Nor computes the effects of hygroscopic growth of aerosols on water uptake and optical properties by means of lookup tables that take relative humidity as an input. In NorESM1, the grid-point average relative humidity was used. In CAM6-Nor we instead use the mean cloud-free relative humidity, in line with CAM6 and a number of other atmospheric models (Textor et al., 2006;Kirkevåg et al., 2018;Gliß et al., 2020). The cloud-free relative humidity (RH) is calculated assuming 100 percent 170 RH in the cloudy volume.
The other differences of CAM6-Nor relative to CAM6 are summarised as follows. A correction to the zonal wind increments due to the Lin and Rood (1997) dynamical core is introduced in order to achieve global conservation of atmospheric angular momentum along the Earth's axis of rotation, as described and discussed in Toniazzo et al. (2020). The local energy update of the model is also modified by including a missing term (the hydrostatic pressure work) related with changes in atmospheric 175 water vapour and thus achieves better local energy conservation. Finally, a set of modifications to the deep convection scheme is introduced which eliminate most of the resolution dependence of the scheme, and mitigate the cold tropospheric bias of CAM6. The energy and convection changes (which are not available in the CAM6 code repository) are described in Toniazzo et al. (in prep.).

Ocean model 180
The ocean component BLOM is based on the version of MICOM used in NorESM1 and shares the use of near-isopycnic interior layers and variable density layers in the surface well-mixed boundary layer. The dynamical core is also very similar but with notable differences in physical parameterisations and coupling. For vertical shear-induced mixing a second-order turbulence closure (Umlauf and Burchard, 2005;Ilicak et al., 2008) using a one equation closure within the family of k − ε models has replaced a parameterisation using the local gradient Richardson number according to Large et al. (1994). Parameterised eddy-185 induced transport is modified to more closely follow the Gent and McWilliams (1990) parameterisation with the main impact of increased upper ocean stratification and reduced mixed layer depths. As for NorESM1-MICOM, the estimation of diffusivity for eddy-induced transport and isopycnic eddy diffusion of tracers is based on the Eden et al. (2009) implementation of Eden and Greatbatch (2008) with their diagnostic equation for the eddy length scale, but modified to give a spatially smoother and generally reduced diffusivity. Hourly exchange of state and flux variables with other components is now used compared to 190 daily ocean coupling in NorESM1. The sub-diurnal coupling allows for the parameterisation of additional upper ocean mixing processes. Representation of mixed layer processes is modified to work well with the higher frequency coupling and in general to mitigate a deep mixed layer bias found in NorESM1 simulations. The penetration profile of shortwave radiation is modified, leading to a shallower absorption in NorESM2 compared to NorESM1. With respect to coupling to the sea ice model, BLOM and CICE now use a consistent salinity dependent seawater freezing temperature (Assur, 1958). Selective damping of external 195 inertia-gravity waves in shallow regions is enabled to mitigate an issue with unphysical oceanic variability in high latitude shelf regions, causing excessive sea ice formation due to breakup and ridging in CMIP5 versions of NorESM1.
For the CMIP6 contribution, BLOM uses identical parameters and configuration in coupled ocean-sea ice OMIP (Ocean Model Intercomparison Project; Griffies et al., 2016) experiments and fully coupled NorESM2-LM and NorESM2-MM experiments, except for sea surface salinity restoring in OMIP experiments. As for NorESM1, 53 model layers are used with 200 two non-isopycnic surface layers and the same layer reference potential densities for the layers below. A tripolar grid is used instead of the bipolar grid in CMIP5 versions of NorESM1, allowing for approximately a doubling of the model time step. At the equator the grid resolution is 1 • zonally and 1/4 • meridionally, gradually approaching more isotropic grid cells at higher latitudes. The model bathymetry is found by averaging the S2004 (Marks and Smith, 2006) data points contained in each model grid cell with additional editing of sills and passages to their actual depths. The metric scale factors are edited to the realistic 205 width of the Strait of Gibraltar so that strong velocity shears can be formed, enabling realistic mixing of Mediterranean water entering the Atlantic Ocean. OMIP provides protocols for two different forcing datasets, OMIP1 (Large and Yeager, 2009) and OMIP2 (Tsujino et al., 2018). Tsujino et al. (2020) is a model intercomparison evaluating OMIP1 and OMIP2 experiments, including BLOM/CICE of NorESM2. Further details on the BLOM model and its performance in OMIP coupled ocean-sea ice simulations can be found 210 in Bentsen et al. (in prep.).

Ocean biogeochemistry
The ocean biogeochemistry component iHAMOCC (isopycnic coordinate HAMburg Ocean Carbon Cycle model) is an updated version of the ocean biogeochemistry module used in NorESM1. The model includes prognostic inorganic carbon chemistry following Dickson et al. (2007). A Nutrient Phytoplankton Zooplankton Detritus (NPZD) type ecosystem model (Six and natural inorganic carbon tracers, which can be used to facilitate a more detailed diagnostic of interior ocean biogeochemical dynamics. Due to the identical ocean component between NorESM2-LM and NorESM2-MM, the performance in ocean biogeochemistry is very similar in both model versions. Compared to NorESM1, the climatological interior concentrations of oxygen, nutrients, and dissolved inorganic carbon have improved considerably in NorESM2. This is mainly due to the improvement in the particulate organic carbon sinking scheme, allowing more efficient transport and remineralization of organic materials in the deep ocean. The seasonal cycle of air-sea gas exchange and biological production at extratropical regions was 225 improved through tuning of the ecosystem parameterizations. The simulated long-term mean of sea-air CO 2 fluxes under the pre-industrial condition in NorESM2-LM is -0.126 ± 0.067 Pg C yr −1 . Under the transient historical simulation, the ocean carbon sink increases to 1.80 and 2.04 Pg C yr −1 in the 1980s and 1990s, which is well within the present day estimates. Details on the updates and improvements of the ocean biogeochemical component of NorESM2 are provided in Tjiputra et al. (2020).

Sea ice
The sea ice model component is based upon version 5.1.2 of the CICE sea ice model of Hunke et al. (2015). A NorESM2specific feature however is to include the effect of wind drift of snow into ocean following Lecomte et al. (2013), as described in Bentsen et al. (in prep).
The CICE model uses a prognostic ice thickness distribution (ITD) with five thickness categories. The standard CICE 235 elastic-viscous-plastic (EVP) rheology is used for ice dynamics . The model uses mushy-layer thermodynamics with prognostic sea ice salinity from Turner and Hunke (2015). Radiation is calculated using the Delta-Eddington scheme of Briegleb and Light (2007), with melt ponds modeled on level, undeformed ice, as in Hunke et al. (2013).
CICE is discretised on the same horizontal grid as the ocean model (Sect. 2.3), and is configured with 8 layers of ice and 3 layers of snow.

Land
The NorESM2 land model is CLM5 (Lawrence et al., 2019) with one minor modification described below. A general description of the model will therefore not be presented here. It should however be noted that CLM5 has a new treatment of nitrogen-carbon limitation, which is very important for the carbon cycle in NorESM2 and has increased the land carbon uptake substantially relative to NorESM1 (Arora et al., 2019). An overview of gross primary productivity (GPP) and soil and vege-245 tation carbon pools are provided in Table 3, showing a substantially better agreement with observations for both resolutions of NorESM2 than NorESM1. There is consistency between observations and model simulations at different resolutions for GPP and vegetation carbon, whereas both NorESM2 versions produce a negative bias in soil carbon. These results broadly agree with results from offline (land only) simulations with CLM described by Lawrence et al. (2019), who also describe the individual model updates from CLM4 (used in NorESM1) to CLM5.

250
In NorESM2, one specific modification was made to the surface water treatment in CLM. The surface water pool is a new feature replacing the wetland land unit in earlier versions of CLM (introduced in CLM4.5). This water pool does not have a frozen state, but is added to the snow-pack when frozen. To avoid water being looped between surface water and snow during alternating cold and warm periods, we remove infiltration excess water as runoff if the temperature of the surface water pool is below freezing. This was done to mitigate a positive snow bias and an artificial snow depth increase found in some Arctic 255 locations during melting conditions.

Coupler
The state and flux exchanges between model components and software infrastructure for configuring, building and execution of model experiments is handled by the CESM2 coupler Common Infrastructure for Modeling the Earth (CIME; Danabasoglu et al., 2020). The coupler computes the turbulent air-sea fluxes of heat and momentum and in NorESM2 this is implemented as 260 a version of the COARE-3 (Fairall et al., 2003) scheme, replacing the calculation based on Large and Yeager (2004) in CESM2.
State and flux exchanges via the coupler between atmosphere, land and sea ice components occur half-hourly, aligned with the atmosphere time step, while the ocean exchanges with the coupler every hour. CIME also provides common utility functions and among these are estimation of solar zenith angle. In NorESM2, this utility function is modified with associated changes in atmosphere, land and sea ice components, ensuring that all albedo calculations use zenith angle averaged over the components 265 time-step instead of instantaneous angles.

NorESM2 initialisation and tuning
Most of the general development of the model as described in Sect. 2 was tested in configurations with reduced number of interactive components. CAM6-Nor was tuned in Atmospheric Model Intercomparison Project (AMIP) configuration with mean climatological radiative forcings and boundary conditions (sea-surface temperatures -hereafter "SST"s -and sea-ice) 270 derived from observations over the period 1990-2010. Similarly, BLOM and iHAMOCC were primarily tuned with prescribed atmosphere and runoff forcing of the OMIP1 protocol. The scope of these separate experiments was to test improved representations of the physical processes in the simulations, with the twin aims of mitigating model systematic biases when compared to the observed climate, and to achieve a net radiative flux imbalance at the top of the model atmosphere (hereafter RESTOM; defined as positive inward, i.e. warming the climate) more in line with satellite-based estimates, given the observed SSTs.

275
The first coupled version of NorESM2 included all changes described in Sect. 2. This version was heavily tested in a preindustrial setting (as defined in Sect. 4).
This initial version of the coupled model was initialized using a hybrid of observational estimates and earlier model simulations. The ocean model was initialised with zero velocities and temperature and salinity fields from the Polar science center Hydrographic Climatology (PHC) 3.0 (updated from Steele et al., 2001). Following the OMIP protocol (Orr et al., 2017), the 280 nutrients (phosphate, nitrate, and silicate) and oxygen fields in NorESM2 were initialized with the gridded climatological fields of the World Ocean Atlas database (Garcia et al., 2014a, b). For dissolved inorganic carbon and total alkalinity, we used the pre-industrial and climatological values from the Global Ocean Data Analysis Project (GLODAPv2) database (Lauvset et al., 2016). Other biogeochemical tracers are initialized using values close to zero. CAM and CLM were initialized using the files included in the CESM2 release. Aerosols and aerosol precursors were initialised to near zero values. As there were no low-285 resolution pre-industrial initial files for the land model available this was replaced by an interpolation of the 1 • initial file from CESM2. At a later stage in the coupled spin-up, the land surface fields were re-initialised from a long (approximately 1400 years) stand-alone CLM spin-up simulation driven by repeating 50 years of coupling exchange fields obtained from the earlier coupled run.
Similar to CESM, NorESM2 adjusted towards its own coupled climatology with an initial phase of strong cooling in the high 290 latitudes of the northern hemisphere, after which an intensification of ocean heat advection stabilised the simulation. After that point, the climatology tended to settle towards a steady-state. During major tuning steps, the coupled model had to be restarted from the initial state several times. In order to save computer resources, minor tuning, especially towards reducing RESTOM, was performed on the best-candidate simulation after this initial, large adjustment. Alongside the final tuning, the CESM components were updated to the versions found in CESM2.1. The changes from CESM2.0 to CESM2.1 are mostly technical 295 but also include minor bug fixes and updated forcing fields (Danabasoglu et al., 2020). The update was done after an initial adjustment, but early in both spin-ups, approximately 1000 model years before the start of the control, at both resolutions. The impact on the global fields are quite small as can be seen in Fig. S1 and Fig. S2 in the supplement. In this second phase of coupled spin-up, it was found that the sensitivity of some aspects of the simulated coupled climatology to small changes in parameters or parametrisations could be different than that found in stand-alone simulations of the individual components with 300 prescribed boundary conditions. The coupled response could be both amplified or damped with respect to single-component simulations. As a result, some of the final parameter tuning of the model had to be performed in coupled mode. No tuning was performed during the pre-industrial control simulation as described in Sect. 4.1 The main goal of the coupled tuning process was to create an energy balanced pre-industrial control simulation with a reasonably stable, adjusted equilibrium state. The simulation can produce a steady climatology only if the time-average radiative 305 imbalance on the top of the model vanishes. In practice, a commonly used target is to bring RESTOM to within ±0.1 W m −2 while maintaining values of mean atmospheric and ocean temperatures close to observations. To achieve this, each change in the coupled model was tested in parallel in atmosphere-only (AMIP) and ocean-only (OMIP) mode. As ocean heat gain and tropospheric air temperature, humidity and cloudiness are strongly associated with the fluxes at the top of the atmosphere, improving the state in the coupled simulation, and reducing RESTOM and drift in AMIP and OMIP simulations, are closely 310 connected goals. On the other hand, fine tuning of the coupled state should not significantly degrade important climatological variables such as temperature, precipitation, clouds, or the main mode of coupled variability, i.e. the El-Niño Southern Oscillation (ENSO). Our parallel testing procedure ensured that the model simulation maintained a degree of consistency both with the present-day, observed climatology, and with a steady pre-industrial climate. Where available, notably in SST and sea-ice, observational estimates of the state of Earth's pre-industrial climate were also considered against the coupled integrations. Each 315 tuning step was performed in isolation, and an effort was made to ensure the greatest possible similarities in the two model configurations LM and MM. No tuning was performed that attempted to target other modes of variability beside ENSO, or a particular climate response to external forcings, e.g. from changes in greenhouse gas concentration, anthropogenic aerosol emissions, or volcanic or solar forcing. Similar to CESM2 (Danabasoglu et al., 2020), NorESM2 tended to develop excessive sea ice cover in the Labrador Sea 320 (LS) region, although the temporal development in NorESM2 differed from CESM2. For any tested combination of parameter choices, NorESM2 developed excessive LS sea ice cover starting around year 60 after model initialisation. This was however only a temporary model state and in all experiments the sea ice returned close to observed state in the LS region after additional 60-80 model years of simulation.
One of the most common methods to tune RESTOM is to change the amount and thickness of low clouds. The main param-325 eter used for tuning the low clouds in the Cloud Layers Unified By Binormals (CLUBB) scheme is the "gamma" parameter, which controls the skewness of the assumed Gaussian probability density function for subgrid vertical velocities. A low gamma implies weaker entrainment at the top of the clouds, in particular for marine stratocumulus. This increases the amount of low clouds and results in a higher short-wave cloud forcing.
Given the same gamma values, the RESTOM was higher in the low resolution version of the model. In addition the sensitivity 330 to the change of the gamma parameter was different in the two model resolutions, so a different choice of gamma was needed for the two resolutions. The final parameter values are well within the gamma range of 0.1-0.5 tested by Zhang et al. (2018), although smaller than the values used in CESM2 at the same resolution. A small gamma pushes up short-wave cloud radiative forcing (SWCF), which led to a high bias in SWCF in NorESM2-LM. This bias was somewhat off-set by regulating the parameter dcs (autoconversion size threshold for cloud ice to snow), with a small impact on the tropospheric temperature bias.

335
While the amount of change in SWCF could be estimated by running the atmosphere and land model in a stand-alone configuration, the change in RESTOM in coupled set-up was small compared to the change in cloud forcing. Further attempts at reducing positive RESTOM by tuning the boundary layer stability were neutralised by SST adjustment, while worsening the tropospheric cold bias. A more effective tuning of low cloud radiative effects was achieved by modifying air-sea fluxes of DMS. Compared to Schwinger et al. (2017), the parameter controlling DMS production by diatoms was doubled in NorESM2, 340 which allowed to maintain high DMS concentration at high latitudes during spring and summer seasons in both hemispheres, as in observations (Lana et al., 2011). This tuning compensates for the reduced primary production simulated in NorESM2 compared to that in NorESM1 .
RESTOM was decisively reduced, both in stand-alone (AMIP) and in coupled simulations (before SST adjustment), by increasing outgoing long-wave radiation. This was achieved in three ways. First, alterations were made to the Zhang and 345 McFarlane (1995) convection scheme, as described in Toniazzo et al. (in prep.), aimed at increasing mid-and high-altitude latent heating of the atmosphere for a given amount of precipitation. Second, positive cloud radiative forcing in the terrestrial radiation spectrum was reduced by intervening on the parameterisation of ice cloud fraction. Finally, higher sea-surface temperatures in coupled simulations were achieved by reducing the value of the parameter controlling background vertical mixing in the ocean back to that used in NorESM1. Initial optimisation in stand-alone configurations had led to increase the value of 350 this parameter by about 50%.
A remarkable sensitivity of the model climatology on the parametrisation of the ice cloud fraction was found. This purely empirical part of the cloud parametrisation of CESM2 is rather ad-hoc and poorly constrained by observations. Several namelistcontrolled options for ice-cloud fraction are provided in CESM. Initial tuning of the parameters of the CESM2 default option appeared promising, but coupled adjustment again tended to neutralise the effect on model radiative imbalance. In NorESM2-355 LM, an effective reduction in the high-and mid-level cloud cover could only be achieved by switching to a different parameterisation option, in which there is no direct functional dependence of ice cloud fraction on environmental relative humidity (this is option number 4 in CESM). By contrast, in NorESM2-MM the CESM default scheme (option number 5, with explicit RH dependence) could be modified by allowing a continuous narrowing of the range of cloud sensitivity to environmental RH. This modification thus constitutes a continuous switch between the two parameterisation options. A target for future development 360 might be to represent ice clouds in a way better rooted in physical processes.
We give a concise summary of the parameters that were used for tuning NorESM2, with their final value and a comparison with CESM2, in Table 2. We consider three sets of experiments that are important for documentation and application of CMIP6 models: the DECK experiments (Eyring et al., 2016), the CMIP6 Historical experiment (Eyring et al., 2016), and the Tier 1 experiments of the ScenarioMIP (O'Neill et al., 2016). A brief description of the set-up of these experiments is given in Sect. 4.1.
The analysis is divided into three parts. Section 4.2 focuses on the stability of the pre-industrial control simulation. In 380 Sect. 4.3, we consider the simulated climate sensitivity to abrupt and gradual quadrupling of CO 2 . A brief analysis of the warming, sea ice, AMOC, and the transport through the Drake Passage in the historical simulations and the scenarios is given in Sect. 4.4.

Experiment set-up
As described by Eyring et al. (2016), a set of common experiments known as DECK (Diagnostic, Evaluation, and Charac-385 terization of Klima) has been defined to better coordinate different model intercomparisons and provide continuity for model development and model progress studies. The DECK consists of the following four baseline experiments: (1) the Historical Atmospheric Model Intercomparison Project (AMIP) experiment; (2) the pre-industrial control (piControl) experiment defined by estimated forcings from 1850, started from initial conditions obtained from a spin-up with the same, constant forcings during which the coupled model climatology stabilises towards a stationary statistics; (3) an experiment otherwise identical to 390 piControl, except that the CO 2 concentrations are set to four times the piControl concentrations, from piControl initial conditions (abrupt-4xCO2); (4) an experiment otherwise identical to piControl, but where the CO 2 concentrations are gradually increased by 1% per year starting from piControl concentrations and initial conditions (1pctCO2). Both abrupt-4xCO2 and 1pctCO2 were started from year 1 of the control.
The DECK was produced with both versions of the model (NorESM2-LM and NorESM2-MM) and here we consider results 395 from the pre-industrial control and the abrupt-4xCO2 and 1pctCO2 (Sect. 4.2-4.3). As this paper focuses on the coupled aspect of NorESM2, the AMIP runs are not included here, but are described in Olivié et al. (in prep.) and Toniazzo et al. (in prep.).
Another qualifying experiment required for entry in CMIP6, and important for model evaluation with respect to observations, is the historical experiment. In this experiment, time-dependent forcings are specified to reflect observational estimates valid for the so-called historical period, viz. 1850-2014. Following CMIP6 guidelines, for this experiment we carried out a small 400 ensemble of integrations, consisting of 3 members. This helps isolate the forced signal from internal climate variability. The three model integrations of the ensemble differ only in their initial conditions, which were obtained from model states late in the spin-up at intervals of 30 model years apart. This is analogous to the historical ensemble of NorESM1 produced for CMIP5.
Beyond the DECK, one of the most important applications for ESMs is to provide estimates of future climate change. This is typically done using scenarios which specify future anthropogenic forcing of the climate that include changes in land-use (such includes solar forcing, prescribed oxidants used for describing secondary aerosol formation, greenhouse gas concentrations, stratospheric H 2 O production from CH 4 oxidation, ozone used in the radiative transfer calculations, and land-use. While the 415 experiments in this paper use prescribed greenhouse gas concentrations, NorESM2 can also be run with CO 2 emissions as described by Tjiputra et al. (2020).
NorESM2 lacks a physical representation of the stratosphere; instead, appropriate upper-boundary conditions need to be specified. Accordingly, stratospheric aerosols and emissions of aerosols and aerosol precursors were prescribed based on the data provided by the input4mips website: https://esgf-node.llnl.gov/projects/input4mips/. In addition, sulphur from tro-420 pospheric volcanoes was included similarly to Kirkevåg et al. (2018), see Sect 2.2.

Simulated control climatology and residual drift
After tuning and an initial spin-up, both NorESM2-LM and NorESM2-MM were integrated for 500 years with steady preindustrial forcings to produce the piControl experiments. Below, we present a basic analysis of the general state and drift of important parameters of the simulated climatology. 425 During the control integration the forcings as well as the parameter choices were kept constant. There should be no long-term drift in the model state variables or their partial tendencies (hence, a fortiori, in radiative fluxes). More precisely, any residual drift of the simulated control climatology should be negligibly small compared with the signal resulting from the response to changes in climate forcings as prescribed in the historical, enhanced-greenhouse gas, and scenario experiments. In practice, a reasonable target is to maintain the RESTOM of piControl within ±0.1 W m −2 in the time mean. Any small imbalance in  As can be seen in the figure the drift is generally small and comparable for the two model versions. The top-of-the-atmosphere radiative imbalance is -0.057 W m −2 for NorESM2-LM and -0.065 W m −2 for NorESM2-MM. The ocean volume temperature 435 change of 0.03 K over 500 years is much smaller than the rate of warming observed during the last 50 years. Similarly, there are positive trends in global mean ocean salinity of 2.6 × 10 −5 g kg −1 and 4.7 × 10 −5 g kg −1 over 500 years for NorESM2-LM and NorESM2-MM, respectively, that we consider small since for NorESM2-MM this is equivalent to an average surface freshwater loss of 2.9 × 10 −5 mm day −1 . The remaining trends are not significantly different from 0 at the 95% confidence level, as estimated from a t-test. We found however a slight decrease in DMS sea-to-air flux of 2% over the 500 year control 440 period, reflecting a residual drift in ocean biogeochemistry. AMOC variations are reasonably small and show no significant trend.

Equilibrium climate sensitivity and transient response
The two enhanced greenhouse gas experiments of the CMIP-DECK were started at the same initial conditions as piControl (and consequently assigned the same notional model year). They are referred to as abrupt-4xCO2 and 1pctCO2. 445 Figure 3 shows the time evolution of near-surface temperature for abrupt-4xCO2, 1pctCO2 and piControl for both model configurations. Three commonly used metrics for the response to CO 2 forcing, based on the evolution of the simulated globalmean temperature, are the Equilibrium Climate Sensitivity (ECS), the Transient Climate Response (TCR), and the Transient Climate Response to cumulative CO 2 Emissions (TCRE). Their values are given in Table 1 for the NorESM2 experiments, and compared to those for NorESM1. The ECS is defined as the change in global near-surface temperature when a new climate 450 equilibrium is obtained with an atmospheric CO 2 concentration that is doubled compared to the pre-industrial amount. In order to reach a new equilibrium, a model simulation of several thousand years is required (Boer and Yu, 2003). There are some examples in the literature of models for which this has been done e.g. Paynter et al. (2018) show results from simulations with GFDL-CM3 and GFDL-ESM2 run for more than 4000 years. Given certain assumptions, ECS may be estimated from method (Gregory et al., 2004). This estimate has become a standard in CMIP6. The figures reported in Table 1 are calculated using years 1-150 from the simulations shown in Fig. 3, and are divided by 2 to get the number for CO 2 doubling instead of quadrupling. The ECS is 2.54 K for NorESM2-LM, which is slightly lower than the equivalent value for NorESM1 of 2.8 K.
Both are significantly lower than the CMIP5 mean value of 3.2 K but well inside the bounds of the likely range of 1.5-4.5 K (Stocker et al., 2013). On the other hand, the ECS in NorESM2 is markedly smaller than the ECS found in CESM2 of 5.3 K 460 by Gettelman et al. (2019a), despite sharing many of the same component models. An extensive analysis of the low ECS value in NorESM2 is given in Gjermundsen et al. (2020). Note that the aerosol forcing is not very different between NorESM2 and CESM2 and can not explain the discrepancy in ECS values. Several sensitivity experiments have been conducted and are reported in Gjermundsen et al. (2020)  This suggests that the actual equilibrium temperature response to a large GHG forcing (the value one finds when the model is 470 run for many hundred years) in NorESM2 and CESM2 is not very different, but that the Gregory et al. (2004) method based on the first 150 years does not give a good estimate of ECS for models.
The transient climate sensivity (TCR) is defined as the global-mean surface temperature change at the time of CO 2 doubling, and accordingly it was calculated from the temperature difference between the 1pctCO2 experiment averaged over years 60-80 after initialisation and piControl. The TCR is 1.48 K and 1.33 K for NorESM2-LM and NorESM2-MM, respectively. As for 475 ECS, these values fall in the lower part of the distribution obtained from the CMIP5 ensemble (Forster et al., 2013), similar to those obtained for NorESM1. The TCR of both NorESM2-LM and NorESM2-MM are lower than the value of 2.0 K found for CESM2 (Danabasoglu et al., 2020). A recent observational estimate for the 90% likelihood range of TCR is 1.2-2.4 K (Schurer et al., 2018).
We also give an estimate of the transient climate response to cumulative carbon emissions (TCRE) calculated from TCR 480 and the corresponding diagnosed carbon emissions. Following Gillett et al. (2013), TCRE is defined as the ratio of TCR to accumulated CO 2 emissions in units of K EgC −1 . As CO 2 fluxes were not calculated in NorESM1-M and NorESM1-Happi, the NorESM1 values are obtained from the carbon cycle version of NorESM1 (NorESM1-ME; Tjiputra et al., 2013). TCRE is reduced from 1.93 K EgC −1 in NorESM1-ME to 1.36 K EgC −1 and 1.21 K EgC −1 in NorESM2-LM and MM, respectively. Since TCR is comparable, the main difference is due to changes in carbon uptake. NorESM1, with CLM4 as the

Climate evolution in historical and scenario experiments
In this section we provide a very brief analysis of the response of the model to historical forcings in the three historical members carried out in both NorESM2-LM and NorESM2-MM. We also consider the model response for the Tier 1 experiments 490 from ScenarioMIP (SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5). The focus here will be on the response in global-mean near-surface temperature, the Atlantic Meridional Overturning Circulation (AMOC), the volume transport through the Drake Passage, and on sea ice area.  Figure S3b shows the time evolution of ERF for the first ensemble member of NorESM2-LM. Given that the ERF is not an observable quantity, we have also included time series of aerosol optical depth 505 which can be related to measurements (Fig. S3a) along with a comparison of aerosol optical depth with observations (Fig. S4).
Detailed analysis of the aerosol properties is done in Olivié et al. (in prep.). Note also that our choice of the reference period for temperature anomaly computation (1850-1880) enhances the NorESM2 negative bias with respect to observations in the last half of the 20th century. NorESM1, under the scenarios RCP2.6, RCP4.5, and RCP8.5, the surface air temperature in the period 2071-2100 was 0.94, 1.65, and 3.07 K higher than in 1976-2005. For the same periods and looking at SSP1-2.6, SSP2-4.5, and SSP5-8.5, we find rather similar (but slightly stronger) warmings of 1.06, 1.81, and 3.22 K in NorESM2-LM, and 1.11, 1.83, and 3.26 K in NorESM2-MM.

520
The simulated Atlantic Meridional Overturning Circulation (AMOC) at 26.5 • N shows a multi-centennial variability that is 15% of the mean in the control simulation (Fig. 2). In the historical simulations the AMOC peaks for both MM and LM in the 1990's at around 24 Sv before starting a rapid decline at around year 2000 (Fig. 6). In both versions the AMOC reaches a quasi-equilibrium by the end of the century at around 15-10 Sv depending on the scenario. Since we only have a few ensemble members, it remains unclear how fast the AMOC declines in response to the greenhouse gas forcing and which part of e.g. the 525 initial decline is due to the multi-decadal variability. In any case, it is noteworthy that the initial AMOC decline begins already during the historical period in both versions, which is also consistent with the NorESM2 and multimodel mean response to the OMIP2 forcing (1958( , Tsujino et al., 2020. In addition to the AMOC, also the Antarctic Circumpolar Current (ACC) strength, as measured in the Drake Passage, shows multi-centennial variability that is about 3% of the mean (Fig. 7). Similar variability in the ACC has been linked to convection 530 within the Weddell and Ross seas in the CMIP5 ensemble (Behrens et al., 2016). Also in our simulations the Weddel Sea convection has similar long term variability as the ACC. Unlike the AMOC, there is no clear trend emerging from the scenario simulations, but rather the multidecadal variability continues throughout the 21st century. Again, a larger number of ensemble members could help identify the forced signal. Both models have a reasonable March sea ice area compared to observations. However, the negative trends in winter sea ice 540 area are small compared to observed trends.
During the scenario period both models show a strong reduction in summer sea ice area. The Arctic Ocean is often considered ice free when the total sea ice area drops below 1 million square km. This threshold is denoted by dotted gray lines in Fig. 8.
NorESM2-LM loses summer ice shortly after year 2050. This occurs first in the SSP5-8.5 scenario, but also the SSP2-4.5 ensemble shows values close to this threshold even before 2050. SSP3-7.0 scenarios become ice free at around 2070. Any 545 prediction of which year the Arctic Ocean first becomes ice free must therefore be considered rather uncertain due to forcing evolution uncertainty and internal variability. This is consistent with the overall assessment of sea ice evolution in CMIP6 assessed by the SIMIP Community (2020). In NorESM2-LM an ice free Arctic Ocean is only avoided in the SSP1-2.6 scenario.
NorESM2-MM loses ice slower and shows the first ice free summer around 2070. In that model, also the SSP2-4.5 scenario keeps the ice area above 1 mill. square km all years before 2100. However, the SSP1-2.6 scenario stabilizes at a sea ice area 550 comparable with present day observations, even with SSP1-2.6 warming levels present. Therefore, the sea ice area simulated by NorESM2-MM for the future Arctic seems to be unrealistically high.
5 Climatological mean state and circulation patterns compared to observations and NorESM1

Ocean state
In the surface ocean, the large-scale climatological biases are similar in the two NorESM2 versions (Fig. 9), but overall the 555 MM version is closer to the observations (smaller global-mean root-mean-square error; RMSE;

√
A 2 in Fig. 9). In general the Southern Ocean is too warm (Fig. 9b-c), the Atlantic (and the Arctic) are too saline, but the Pacific is too fresh (Fig. 9e-f).
The sea level anomaly is lower than observed in the Atlantic basin, but higher in the Indo-Pacific basin and thus the gradient between the two basins is larger than in the observations (Fig. 9h-i). If we remove the global-mean biases, the two versions produce even more similar mean errors, suggesting that some of the regional biases are largely independent of the atmosphere 560 and land resolution.
Indeed, the regional patterns are common to many other models with coarse resolution ocean components . The fresh bias in the Southern Pacific ( Fig. 9) is linked to the co-located positive net precipitation bias as shown in Fig. 19 and extends throughout the surface mixed layer (Fig. 11). The salinity bias also causes a negative density bias (not shown) 575 as it is not fully compensated by temperature, supporting an atmospheric origin. A comparison with the OMIP1 and OMIP2 simulations shows that the net precipitation bias in the LM simulation, 250 mm year −1 in the mean over the region where the salinity bias is larger than 1 g kg −1 , would be large enough to cause the simulated salinity bias (assuming mixed layer depth of 100 m and a residence time of 10 years). Therefore, we suggest that the net precipitation bias leads to accumulation of excess freshwater that is spread throughout the subtropical gyre by the ocean circulation.  Figs. 10-11). These biases are stronger in the LM version and not visible or much less pronounced in the OMIP simulations (not shown), which suggests that they are due to biases in the surface heat and freshwater budgets in these semi-enclosed basins. In addition, there is a strong cold and fresh (warm and saline) bias in the Pacific (Atlantic) centered around 15 • S and 200-400 m depth. These anomalies are likely linked to the biases in the tropical upwelling and the resulting thermocline depth that is too shallow (deep) in the Pacific (Atlantic).

590
Overall, many of such sub-surface ocean biases are similar in the ocean-only simulations and may be linked to coarse ocean resolution and shortcomings in parameterised processes. In some regions, air-sea coupling tends to act to reinforce biases that may be generated in either atmosphere or ocean model components separately. The biases over the upwelling systems for example have generally a complex cause rooted in both local (including mesoscale) and remote (including equatorial) biases in both atmosphere and ocean model components (Toniazzo and Woolnough, 2014;Zuidema et al., 2016;Stammer et al., 2019). For

595
NorESM2 the biases in the coupled simulations have a similar pattern as, but approximately twice the magnitude of the biases in the OMIP simulations (not shown). The cold bias in the northern subtropical Pacific has a contribution from weak oceanic mixing as there is a large warm bias just below the surface (Fig. 10), but may be amplified by increased atmospheric stability and correspondingly enhanced boundary-layer clouds. Excessively negative short-wave cloud forcing is seen in that region, in contrast to AMIP simulations which show no such regional bias. In the central and eastern equatorial Pacific NorESM2 displays 600 a characteristic "cold tongue" bias with cold SSTs and easterly wind stress bias. An equatorial easterly bias is present in the NorESM2 AMIP simulations. Shonk et al. (2017) show that off-equatorial net precipitation biases alone can initiate a feedback leading to an equatorial Pacific cold tongue in coupled simulations, and CAM6-Nor tends to develop such a bias. Finally, the near-surface ocean temperature bias pattern in OMIP1 simulations is cold along the equator, and warm on each side, which may further enhance off-equatorial precipitation. It should be noted that OMIP2 simulations with BLOM/CICE have a warm 605 bias along the equator (Tsujino et al., 2020). The cold equatorial bias can affect ENSO variability and teleconnections. These are discussed further below.

Sea ice
The geographic distribution of sea ice in March and September, compared with observations are shown in Fig. 12 for NorESM2-LM (12e-h), and NorESM2-MM (12i-l). In common for both models for the Northern Hemisphere (Fig. 12e,f,i,j) are too large 610 sea ice extents in the Barents Sea and Greenland Sea and a too small extent in the Labrador Sea, Bering Sea, and Sea of Okhotsk during winter. The total areas are quite close to the observations as shown in Fig. 8. These regional biases are most likely due to persistent biases in the oceanic and atmospheric circulation.
During summer, the distribution of sea ice in NorESM2-LM (Fig. 12f) seems to be more variable. Apart from the persistent, positive bias in the East Greenland Current, the regional biases within the Arctic Ocean are more likely due to inter-annual 615 variability, and the effect that the observations show a larger downward trend than the model.
NorESM2-MM (Fig. 12j) shows too much sea ice in the central Arctic in September. In general, the model is colder in the Arctic than NorESM2-LM (Fig. 14), and it has thicker sea ice in the Arctic Ocean. The Northern Hemisphere sea ice volume in NorESM2-MM is 21% (36%) larger in March (September) compared with the NorESM2-LM (not shown). The smaller seasonal cycle in ice area (Fig. 13) and volume is consistent with a thicker sea ice cover in NorESM2-MM, both due 620 to less winter growth because of increased insulation, and less summer melt due to higher albedo. The situation encountered in NorESM2-MM is similar to the results from NorESM1-M  and NorESM1-Happi .
These models simulate ice cover that is too thick, with the reduction in the Northern Hemisphere summer ice area being too slow.
The winter sea ice area and extent is too low in the Southern Ocean in NorESM2 as seen in Fig. 13 and Fig. 12(g-h,k-l).

625
Winter area in September is around 4 million square km too small. The largest bias is found in the Atlantic-Indian sector.
This bias seems to be associated with the warm bias in the ocean model, and the too warm Antarctic Intermediate Water (AAIW). The exact reason for this problem is not known, but the warm bias in AAIW is also evident in the OMIP simulations (not shown). However, these uncoupled simulations have a reasonable representation of the upper ocean temperature and the winter sea ice extent that are most likely due to the inherent relaxation towards observed atmospheric temperatures in those 630 experiments. With the interactive atmosphere these problems increase.

Atmospheric temperature and winds
In terms of mean surface temperatures, NorESM2 is a warmer model than its preceding versions. The global-mean nearsurface temperature (Fig. 14) in NorESM1-M and NorESM1-Happi is generally too low with global-mean biases of -0.76 K and -1.08 K (see legends above panels in Fig. 14). NorESM2-MM is closer to the reanalysis with a global-mean bias of -0.19 K. performance also in terms of the global-mean RMSE, with 1.35 K compared to 1.62 K for NorESM2-LM, and 1.83 K for NorESM1-Happi, and 1.86 K for NorESM1-M (cf Fig. 14).

640
Temperature biases are reduced in NorESM2 compared to NorESM1, not only near the surface, but also and especially in the mid and upper troposphere (Fig. 15). Tropospheric air temperatures tend to be systematically cold in all versions of both CESM and NorESM. NorESM2 has a reduced cold bias compared to NorESM1 particularly in the tropics and sub-tropics. This is mostly a consequence of the changes made to the cumulus convection scheme (Toniazzo et al., in prep.). The higher tropical SSTs in NorESM2-LM compared to NorESM2-MM lead to a reduced cold tropospheric tropical bias; however persistent 645 cold mid-and high-latitude biases imply an excessive meridional temperature gradient. By contrast, NorESM2-MM shows improvements at all latitudes.
The stronger cool tropospheric and warm surface tropical bias of NorESM2-LM compared with NorESM2-MM is in line with the behaviour of both NorESM1 and CESM2. The systematic difference between the two atmosphere resolutions is also consistent between coupled and AMIP simulations, with CAM-Nor significantly cooler at two degree resolution than at one 650 degree resolution for the same SSTs and the same physics parameters. At the same time, tropospheric specific humidity (and, a fortiori, relative humidity) is higher. Both lead to higher corresponding RESTOM. The ultimate cause of this systematic dependence of the simulated climatology on the resolution of the atmosphere model is not known. There may be a sensitivity of the convection parametrisation to the grid-scale variability of near-surface air parameters and to boundary-layer stability.
column precipitable water appear almost uniformly higher in CAM-Nor at two degree resolution than at one degree resolution.
All four of NorESM2-MM, NorESM2-LM, NorESM1-Happi and NorESM1-M tend to produce tropospheric westerly biases in zonal-mean zonal winds (Fig. 16). At tropical and sub-tropical latitudes, these are more widespread in NorESM2 than NorESM1-M and NorESM1-Happi. Surface wind biases, which by contrast tend to be easterly, are reduced. At higher latitudes, all models tend to have westerly biases on the poleward side of the sub-polar surface jet (between 50 • and 60 • ) in both hemi-660 spheres. The overestimation on the poleward flank is generally more pronounced in NorESM2 than in NorESM1. Comparing NorESM1-M to NorESM1-Happi and NorESM2-LM to NorESM2-MM, the biases in the zonal wind tend to be ameliorated with increased resolution. The differences in the tropics between NorESM2 and its predecessors are in part attributable to the enforcement of conservation of atmospheric global angular or rotational momentum in NorESM2 . In all versions, in common with CAM6/CESM2, there is accumulation of westerly momentum near the model lid, where it is 665 insufficiently damped.

Extratropical storm tracks
Extratropical storm tracks can be defined as regions of storminess associated with cyclogenesis, cyclone development, and cyclolysis which take place in the baroclinic zones between the sub-tropics and polar regions. They are important features at mid-and high latitudes as they are responsible for eddy transport of heat and momentum between low and high latitudes, 670 and associated with potentially high-impact weather such as heavy precipitation and strong winds. Here, we diagnose stormtrack activity by applying a bandpass filter to retain fluctuations in the geopotential height field at 500 hPa with periodicity corresponding to that of baroclinic waves, that is, between 2.5 and 6 days (Blackmon, 1976;Blackmon et al., 1977). The variability of the bandpass-filtered field is dominated by propagating low-pressure and high-pressure systems, and the storm tracks can be defined as geographically localized maxima in bandpass-filtered variability (Blackmon, 1976;Blackmon et al., 675 1977;Chang et al., 2002;Graff and LaCasce, 2012).
The climatological winter storm tracks are shown as the solid black contours in Fig. 17. There are two maxima in the Northern Hemisphere, one over the North Atlantic and one over the North Pacific. The colors show the bias with respect to ERA-Interim (Dee et al., 2011). In NorESM1-M, storm-track activity is underestimated in both storm-track regions. In particular, the North-Atlantic storm track is overly zonal with too little activity on the equatorward side of the climatological 680 maximum as well as over the Norwegian and Barents Sea Graff et al., 2019). The magnitude of the bias is reduced in NorESM1-Happi compared to NorESM1-M in both storm-track regions. This is likely associated with the increased resolution in the atmosphere and land components (1 • in NorESM1-Happi versus 2 • in NorESM1-M).
Similar improvements are seen when comparing NorESM2-LM and NorESM2-MM. Both versions of NorESM2 are, furthermore, better able to simulate the North-Atlantic storm track with the size of the negative bias on its equatorward side being reduced. Overall, NorESM2-MM displays the smallest biases in Northern Hemisphere storm-track activity out of the four models. There remains, however, too little activity over the Norwegian Sea with extension into the Barents Sea.
In the Southern Hemisphere, the climatological winter storm track surrounds Antarctica with the largest variability occurring over the Indian Ocean (Fig. 17). Storm-track activity is generally too weak on the equatorward side, with the largest biases being located over the Indian Ocean, close to the storm-track maximum. As in the Northern Hemisphere, the largest biases are 690 found in NorESM1-M and the smallest biases in NorESM2-MM.
While the bandpass-filter approach yields a measure of storm-track activity, it cannot be used to isolate the individual cyclone centers. To further assess the robustness of the improvements between NorESM2-LM and NorESM2-MM, we therefore also consider results from the cyclone detection algorithm described in Wernli and Schwierz (2006). The method detects cyclones as minima in the sea-level pressure fields and sets the perimeter as the outermost closed sea-level pressure contour. The storm 695 tracks are then seen as maxima in the local frequency of occurrence of surface cyclones, i.e. the fraction of time when cyclones are present in a given point (Fig. 18a-b).
As for the bandpass-filter approach, the cyclone detection shows a clear reduction in the bias between NorESM2-LM and NorESM2-MM, which is likely to be associated with the higher horizontal resolution in the atmosphere and land components.
The cyclone occurrence is underestimated on the equator-ward side of the North Pacific and Southern Hemisphere storm tracks 700 and overestimated on the poleward side. Over the North Atlantic, the cyclone occurrence is underestimated on the equator-ward side of the storm track and over the Norwegian Sea extending into the Barents Sea, and overestimated between The British Isles and Greenland. The magnitude of the bias is clearly reduced in all regions in NorESM2-MM, with the improvement being particularly evident in the regions where the cyclone occurrence is overestimated.
Note that both the climatology and the biases should be expected to differ somewhat between the two approaches considered 705 here because they capture different aspects of the storm tracks. The bandpass-filter approach does not distinguish between cyclones and anti-cyclones, and is dominated by growing and propagating baroclinic waves (Blackmon et al., 1977). The cyclone occurrence reflects the regions where cyclone centers are identified most frequently, and is for instance more sensitive to systems that are slowly moving or too long lived.

Clouds and forcing
710 Table 4 gives an overview of major forcing fluxes in NorESM2 compared to NorESM1 and observational estimates. Despite the large differences in physics and tuning, the overall numbers for top of the atmosphere fluxes and forcings are very similar to the numbers found in NorESM1-Happi and are generally within the observational range. There is however a slightly stronger negative bias in clear-sky LW flux and long wave cloud forcing. The latter is an unfortunate consequence of the tuning of high clouds in the model implemented in order to increase the outgoing long wave radiation. As seen from the upward LW 715 flux estimate the outgoing long wave radiation is still within the estimate from satellite retrievals. SWCF values are very similar to the values of NorESM1-Happi and within the observational range. This number hides, however, a major weakness in NorESM1 stratiform cloud parameterisation which underestimated the cloud cover and compensated this by overestimating the cloud liquid water.
The major updates in cloud physics from CAM4 to CAM6  improved the cloud cover, and the 720 cloud liquid water path is now quite close to the observational estimate. The global cloud cover is still slightly lower than observed (Table 4). This is partly connected to the tuning in NorESM2. Prior to the tuning the modelled cloud cover was higher than 70%. As seen from Fig. S5 in the supplement, the cloud cover underestimate is most pronounced in the tropics and subtropics in both hemispheres, while there is good agreement around the extra-tropical stormtrack regions and an overestimate in the high Arctic. Before the tuning (not shown) there was no bias at the low latitudes.

725
The modelled liquid water path seems to have a systematic bias towards low values at low latitudes and high values in the extra-tropics. Possible connections between cloud cover biases and the hydrological cycle are discussed in the next section.

Precipitation and hydrological cycle
The bias in the annual-mean total precipitation rate is shown in Fig. 19 for the two versions of NorESM1 and NorESM2 relative to data from the Global Precipitation Climatology Project (GPCP; Adler et al., 2003). While the bias of the global-730 mean average is not systematically reduced between NorESM1 and NorESM2, there is a reduced RMSE, indicating that there is less cancellation between positive and negative biases in the global mean.
The reduction of the RMSE is also seen when considering the four seasons separately in Fig. S6 in the supplement along with climatology from the GPCP. The evaluation of the mean bias, RMSE, and correlation included in the bottom left corner of each panel shows that RMSE and correlation have improved in NorESM2 compared to NorESM1 for all seasons. While 735 the overall wet bias has increased slightly, mostly due to strong biases over the Pacific ocean, there are regions with a large reduction in mean bias. This is especially pronounced over Africa and equatorial Atlantic ocean. The largest improvement compared to NorESM1-M is seen for NorESM2-MM during northern hemisphere winter, when all three metrics (bias, RMSE, and correlation) consistently indicate higher skill.
As a measure of interannual variability, the standard deviation of monthly means for each season was calculated. The differ-740 ences compared to GPCP are presented in Fig. S7 in the supplement. While NorESM1 slightly underestimates the precipitation variability, it is somewhat too high in NorESM2, with the magnitude of the bias being larger in all seasons. NorESM2-MM improves RMSE of precipitation variability in all seasons except northern hemisphere autumn. As also seen for the mean climatology in Fig. S6 in the supplement, the correlation has improved for all seasons in both NorESM2-LM and MM.
The hydrological cycle (or cycling of fresh water) is of major importance for the climate system. Global means of precipi-745 tation and evaporation can serve as integrated measures of the properties of many processes in an ESM. Results presented in Table 5 indicate that the intensity of hydrological cycle, as measured by evaporation, in NorESM2 is about 1.4% larger globally (4.9% over oceans) than in NorESM1-M. This is also manifested in the positive precipitation biases in Fig. 19. While the values in Table 5 for NorESM2 are higher than for GPCP, they are closer to results from ERA-Interim calculated by Trenberth et al. (2011). Although NorESM1-Happi has the highest precipitation globally, NorESM2 has the highest precipitation over ocean, 750 suggesting a larger re-cycling of oceanic water vapor and a lower fraction transported from oceans to continents (measured by E-P over oceans). The overestimated evaporation over oceans is likely linked to the underestimated cloudiness in the tropics and subtropics (Fig. S5 in the supplement). Solar radiation over subtropical ocean regions is an important driver of evaporation.
The net moisture transport from oceans to continents is nevertheless smaller in NorESM2 than in NorESM1, consistent with more clouds in the extra-tropics and more marine precipitation in NorESM2. This analysis is only preliminary, however, and 755 needs more in-depth studies which is out of scope of the present paper.
In NorESM2 a closed hydrological cycle is present, with the difference between evaporation and precipitation being close to zero in the long-term average at equilibrium. In NorESM2-MM the discrepancy is slightly improved to 0.023 km 3 /year, whereas it is 0.027 km 3 /year in NorESM1-M and 0.031 km 3 /year in NorESM2-LM (all values are means from members 1-3).

Northern Hemisphere blocking 760
While storm tracks are closely tied to precipitation, atmospheric blocking is associated with persistent anti-cyclones that inhibit precipitation for time scales up to several weeks. To diagnose blocking, we apply the variational Tibaldi and Molteni (vTM) index (Tibaldi and Molteni, 1990;Pelly and Hoskins, 2003;Graff et al., 2019). Blocks are identified when there is persistent reversal of the 500 hPa geopotential height field around a central latitude that last for at least five days and cover at least 7.5 consecutive longitudes. The central longitude varies with the position of the maximum in the Northern
The seasonal blocking frequency is mostly underestimated over the North Atlantic and in Europe in the four versions of NorESM (Fig. 20), particularly during winter (DJF). During spring (MAM), NorESM2-MM is closest to the reanalysis, while during summer (JJA) and autumn (SON), NorESM1-Happi performs best in these regions. While NorESM1 tends to overestimate the blocking frequency over the Pacific, NorESM2 generally lies closer to the reanalysis in that sector. Consider, for 770 instance, the region between 120 • E and 180 • E during summer, or the region between 130 • W and 90 • W during winter. In summary, although the use of 30 years from ERA-Interim for verification may not be fully representative for blocking climatology, the representation of NH blocking continues to be a challenge in NorESM, and in particular over the Atlantic-European sector in winter.

775
In the tropical atmosphere, the Madden-Julian Oscillation (MJO) is the dominant mode of variability on timescales between 30 and 90 days (Madden and Julian, 1971;Zhang, 2005). The MJO is characterized by a meridional dipole of convective precipitation anomalies along the equator, that slowly propagates eastwards and interacts with a number of other circulation features such as El Niño events (Hendon et al., 2007), the Indian summer monsoon (Annamalai and Slingo, 2001), tropical cyclones (Liebmann et al., 1994), and even the North Atlantic Oscillation and extratropical variability (Cassou, 2009).

780
The MJO is characterised by a specific feature in wavenumber-frequency spectrum of equatorial 850 hPa zonal wind (U850) and outgoing longwave radiation (OLR; Waliser et al. (2009)), associated with wavenumbers 1-3, a maximum at wavenumber 1, and periods between 30 to 80 days. NorESM2-LM and NorESM2-MM possess this mode in U850 spectra (Fig. S8 in the supplement), but its spread in wavenumber is too narrow and its spread in frequency too wide. Furthermore, OLR variability is too weak, and the mode appears preferentially as a stationary oscillation in the Indian-ocean sector, with too little zonal 785 and meridional propagation (Fig. S9 in the supplement). The relationship between zonal winds and precipitation anomalies, with the former in quadrature with respect to the latter, are similar in the simulations and in observations. In NorESM2-MM ( Fig. S9(c) in the supplement) however the anomalies appear to be generally too weak. Composite plots of MJO events (not shown) indicate a tendency in both model versions to generate westward-propagating convective anomalies, which may weaken activity in the MJO region of the spectrum. The ability of the simulated MJO mode to propagate eastwards as observed appears 790 to be sensitive to the distribution of tropical SSTs in both CESM2 and NorESM2 (Richard Neale, personal communication; Toniazzo et al., in prep.).

El Niño Southern Oscillation (ENSO)
The coupled model internally generates a self-sustained ENSO mode with spatial and temporal characteristics similar to observations. (The timeseries of NINO3.4 SST anomalies are shown in Fig. S10 in the supplement, alongside the observed one).

795
The ENSO in LM and MM model versions are very similar in magnitude (Fig. 22-23), spatial pattern (Figure 23), and spectral power distribution in frequency space (Fig. 21). ENSO SST anomalies are very large compared to observations (with a NINO3.4 anomaly greater than 2.5 • C in the average El-Niño event, compared with 1.5 • C in observations), and they tend to peak early in the season, i.e. between November and December instead of between December and January as observed ( Fig. 22a). The early peak and termination may be partly attributable to weak zonal wind-stress anomalies over the equatorial 800 region, which also peak early, notwithstanding a robust response in equatorial precipitation (Fig. 22b,c). Such weak surface wind response may be caused by the general displacement, with respect to observations, of the maximum of climatological precipitation north of the Equator along the Pacific ITCZ. Especially in MM, precipitation anomalies also have their maximum north of the equator (Figure 23b), which tends to result in weaker equatorial anomalous westerlies. A second origin of the early simulated El-Niño SST peak may however also be found in the early rapid demise of positive thermocline depth anomalies 805 in the NINO3 region during El-Niño events, which is seen also in OMIP1 and OMIP2 simulations forced with prescribed wind-stress (Fig. 22d). Given the weak coupled wind-stress and thermocline activity, the large SST anomalies may be partly the result of insufficient surface damping by the action of anomalous surface heat fluxes.
Correlation analysis shows that indeed over the eastern equatorial Pacific the model tends to generate positive downward net short-wave radiative flux anomalies when SST anomalies are positive, in contrast to observations. This might also explain 810 the growth of positive SST anomalies in the NINO3.4 region early during El-Niño events even before positive 20 • C isotherm depth anomalies have fully reached the area; and the long persistence of both SST and precipitation anomalies in the later stages of El-Niño events. The model climatological bias of a pronounced double ITCZ, with strong ITCZ precipitation away from the equator and a dry, cold equatorial region dominated by marine stratocumulus, rather than trade-cumulus cloud in the eastern Pacific probably contributes to this behaviour. Toniazzo et al. (in prep.) shows that changes in the convection 815 scheme that were made in order to mitigate the tropospheric cold bias and the positive TOA net residual have contributed to this error. Off-equatorial precipitation tends to couple less effectively with eastward-propagating equatorial modes (cf. also previous section and Fig. S9). Westward propagation is also evident in the model's ENSO during the phase change from El-Niño to La-Niña. In spite of such shortcomings, ENSO-related variability in NorESM2 is generally similar to the observed one. In particular, NINO3.4 spectra in the two model versions and in observations are formally indistinguishable (Fig. 21). The 820 simulated composite El-Niño SST, precipitation, and geopotential height anomalies have good global pattern correlations with the observed composite El-Niño anomalies (Fig. S11 in the supplement), indicating that the simulated ENSO adds important and useful features to the climatology simulated by the model. Particularly prominent and fairly realistic are teleconnections into both hemispheres during and after ENSO peaks, with a PNA pattern that extends into the storm-track entry region of the western Atlantic, as observed. In this respect NorESM2-MM validates better than NorESM2-LM, in spite of its equivalent of 825 slightly worse equatorial ENSO biases, probably due to a better overall sub-tropical and high-latitude atmospheric circulation.

Summary and conclusions
This paper presents and evaluates NorESM2 (

840
The drift over the 500 year long pre-industrial control experiment is generally very small for both versions of the model.
NorESM2 is slightly less sensitive than its predecessors and at the lower end of the CMIP5 and CMIP6 multi-model mean for both model resolutions , with the equilibrium climate sensitivity of 2.5 K estimated using the Gregory method (Gregory et al., 2004).
The historical reconstruction of surface temperatures is similar in both model versions. A significant temperature increase In particular NorESM2-LM shows a satisfactory evolution of recent sea-ice area. In NorESM2-LM, an ice-free Arctic Ocean is only avoided in the SSP1-2.6 scenario. NorESM2-MM simulates larger sea-ice area both at present and in future scenarios.
The pattern of some biases seen in the fully coupled simulations considered here are similar in coupled ocean-sea ice 855 simulations carried out for OMIP and can thus be linked to the ocean model having too coarse resolution and shortcomings in parameterised processes. NorESM2-LM and MM largely share the same biases in the surface ocean, although the MM version is somewhat closer to the observations. Most of the large-scale biases in the surface ocean are also seen in the subsurface.
Like CESM2, NorESM2 is generally a "cold" model, with an initial deficit in atmospheric long-wave cooling that causes a positive RESTOM and leads to heat gain by the ocean and positive SST biases particularly in the tropics. NorESM2 represents 860 an improvement in this respect compared to NorESM1. This is particularly evident in the tropical and sub-tropical troposphere (Fig. 15). In addition, the medium-resolution version of the model has more realistic upper-tropospheric meridional temperature gradients, and reduced near-surface temperature biases.
The extratropical storm tracks are generally better simulated in NorESM2 than in NorESM1, particularly over the North Atlantic. The storm tracks additionally improve with higher resolution, both in the Northern and Southern Hemisphere.

865
Several aspects of the modeled cycling of fresh water are improved in NorESM2 compared to NorESM1, including the RMSE and spatial correlation of the bias in the total precipitation rate. The intensity of the hydrological cycle as compared to the observationally-based findings of Trenberth et al. (2011) is slightly exaggerated in NorESM2, as it was in NorESM1, consistent with the underestimated cloudiness and thus overestimated solar radiation in the tropics and sub-tropics. The transport of oceanic water vapor over the continents is smaller in NorESM2 than NorESM1, indicating a slightly too efficient re-cycling of 870 oceanic water vapor associated with over-estimated oceanic precipitation and higher cloudiness in the extratropics.
The seasonal blocking frequency in the Northern Hemisphere is especially underestimated over the Atlantic -European sector during winter (DJF) by NorESM2. During spring (MAM), NorESM2-MM is closest to the reanalysis, while during summer (JJA) and autumn (SON), NorESM1-Happi performs best in these regions. While NorESM1 tends to overestimate the blocking frequency over the Pacific, NorESM2 generally lies closer to the reanalysis in that sector. Although the use of 30 875 years from ERA-Interim for verification may not be fully representative for blocking climatology, the simulation of Northern Hemisphere blocking continues to be a challenge for NorESM.
The coupled model internally generates a self-sustained ENSO mode with spatial and temporal characteristics similar to observations. ENSO SST anomalies are very large compared to observations (with a NINO3.4 anomaly greater than 2.5 • C in the average El-Niño event, compared with 1.5 • C in observations), and they tend to peak early in the season, i.e. between

880
November and December instead of between December and January as observed. Nevertheless many properties of the ENSO are similar to those observed, and El-Niño teleconnections are quite realistic both in the tropics and at mid-and high latitudes.
Less satisfactory is the performance of the coupled model in terms of the Madden-Julian oscillation. Here the model version with low resolution in the atmosphere appears to produce more intense and more realistic sub-seasonal tropical variability than the medium-resolution version.