A circulation-based performance atlas of the CMIP5 and 6 models for regional climate studies in the northern hemisphere

Global Climate Models are a keystone of modern climate research. In many :::: most applications relevant for decision making, and particularly when deriving future projections with the delta-change method, they are assumed to be perfect :::::: provide : a :::::::: plausible ::::: range :: of ::::::: possible ::::: future :::::: climate ::::: states. However, these models have not been originally developed to reproduce the regional-scale climate, which is where information is needed in practice. To overcome this dilemma, two general efforts have been made since their introduction in the late 1960ies. First, the models themselves have been steadily improved in terms of 5 physical and chemical processes, parametrization schemes, resolution and complexity, giving rise to the term “Earth System Model”. Second, the global models’ output has been refined at the regional scale using Limited Area Models or statistical methods in what is known as dynamical or statistical downscaling. Both ::: For :::: both approaches, however, are in principle unable : it :: is ::::::: difficult : to correct errors resulting from a wrong representation of the large-scale circulation in the global model. Also, dynamical downscaling ::::::::: Dynamical ::::::::::: downscaling :::: also has a high computational demand and thus cannot be applied to all 10 available global models in practice. On this background, there is an ongoing debate in the downscaling community on whether to thrive away from the “model democracy” paradigm towards a careful selection strategy based on the global models’ capacity to reproduce key aspects of the observed climate. The present study attempts to be useful for such a selection by providing a performance assessment of the historical global model experiments from CMIP5 and 6 based on recurring regional atmospheric circulation patterns (?) :::: (??). The latest model generation ::::::: (CMIP6) : is found to perform better on average, which can be partly 15 explained by a moderately strong statistical relationship between performance and horizontal resolution in the atmosphere. A few models rank favourably over almost the entire northern hemisphere extratropics, but the better models tend to be less complex than others. Model selection should therefore not solely rely on model performance but also on model complexity and a discussion is needed on how to combine these two criteria. Internal model variability only has a small influence on the model ranks. Reanalysis uncertainty is an issue in Greenland and the surrounding seas, the southwestern United States and 20 the Gobi desert, but is otherwise negligible. :::::::: generally ::::::::: negligible. :::::: Finally, :: a :::::::: relatively :::::: simple ::::::: approach ::::: based ::: on ::: the ::::::: number :: of :::::: climate :::::: system :::::::::: components ::::: taken :::: into ::::::: account :: by ::: the :::::: GCMs :: is :::::::: proposed :: as : a ::::::: starting ::::: point :: to :::::::: introduce ::::: model :::::::::: complexity :: as :: an :::::::: additional :::::: model ::::::: selection :::::::: criterion. :

in the context of bias correction, which can be considered a special case of statistical downscaling (?). It should be also remembered that GCMs by definition were not developed to realistically represent regional-scale climate features (??) and that they have been pressed into this role during the last 3 decades due to the ever increasing demand for climate information on 60 this scale. Hence, finding a GCM capable to reproduce the regional atmospheric circulation in a systematic way, i.e. in many regions of the world, would be anything but expected.
In the present study, a total of 116 ::: 128 : historical runs from 46 :: 56 : distinct GCMs (or GCM versions) of the fifth and sixth phase of the Coupled Model Intercomparison Project (CMIP5 and 6) are evaluated in terms of their capability to represent the present-day climatology of the regional atmospheric circulation as represented by the frequency of the 27 circulation types

Lamb Weather Types
The classification scheme used here is based on H.H. Lamb's practical experience when grouping daily instantaneous SLP maps for the British Isles and interpreting their relationships with the regional weather (?). His subjective classification scheme contained 27 classes and was brought to an automated and objective approach by ? in what is known as the "Lamb Circulation 125 Types" or "Lamb Weather Types" (LWTs) approach (??).
Also, since some models do not apply the Gregorian calendar but work with 365 or even 360 days per year, relative instead of absolute LWT frequencies are considered. Further, since HadGEM2-CC and HadGEM2-ES lack SLP data for December 2005, this month is equally dropped from ERA-Interim or JRA-55 when compared with these models. 175 As mentioned above, the LWT approach has been successfully applied for many climatic regimes of the NH, including the extremely continental climate of central Asia (?), which confirms the proposal made in ? that the method in principle can be applied in a latitudinal band from 30 to 70 • N. Here, : a criterion is introduced to explicitly test this assumption. Namely, it is established that LWTs cannot ::: the ::::: LWT :::::: method :::::: should : not be used at a given grid-box if the relative frequency for any of the 27 types is lower than 0.1% percent (i.e. 15 :: 1.5 : annual occurrences on average). Note that, already in its original formulation 180 for the British Isles, some LWTs were found to occur with relative frequencies as small as 0.47% (?). This is why the 0.1% threshold seems reasonable in the present study. If at a given grid-box this criterion is not met in the LWT catalogue derived from ERA-Interim or alternatively JRA-55, then this grid-box does not participate in the evaluation.

Applied GCM performance measures
To measure GCM performance, the Mean Absolute Error (MAE) of the n = 27 relative LWT frequencies obtained from a 185 given model (m) w.r.t. to those obtained from the reanalysis (o) are calculated at a given grid-box: The MAE is then used to rank the 46 :: 56 : distinct models at this grid-box. The lower the MAE, the lower the rank and the better the model. After repeating this method for each grid-box of the NH, both the MAE values and ranks are plotted for each individual model on a polar stereographic projection.

190
In addition to the MAE measuring overall performance, the specific model performance for each LWT is also assessed.
This is done because, by definition of the MAE, errors occurring in the more frequent LWTs are penalized more than those occurring in the rare LWTs. Hence, a low MAE might mask errors in the least frequent LWTs. For a LWT-specific evaluation, the simulated frequency map for a given LWT and model are compared with the corresponding map from the reanalysis by means of the Taylor Diagram (?). This diagram compares the spatial correspondence of the simulated and observed (or "quasiobserved" since reanalysis data are used) frequency patterns by means of 3 complementary statistics. These are the Pearson correlation coefficient (r), the standard deviation ratio (ratio = σ m /σ o ), with σ m and σ o being the the standard deviation of modelled and observed frequency patterns, and the normalized centred ::::::: centered root mean-square error (CRMSE): , with n = 2016 grid-boxes covering the NH mid-latitudes and cm and co the modelled and observed frequency patterns after 200 subtracting their own mean value (i.e. both the minuend and subtrahend are anomaly fields, "c" refers to centred). Normalization enables for comparison with other studies using the same method.

Applied Python packages
The coding to the present study relies on the Python v2.7.13 packages xarray v0.9.1 written by ? In Figure 2, the MAE of JRA-55 w.r.t. ERA-Interim is mapped (panel a), complemented by the corresponding rank within the multi-model ensemble plus . In the ideal case, the MAE for JRA-55 is lower than for any of the 46 :: 56 CMIP models, which means that the alternative reanalysis ranks first and that a change in the reference reanalysis does not influence the model ranking. This result is indeed obtained for a large fraction of the NH. However, in the Gobi desert, in

240
Greenland and the surrounding seas, and particularly in the southwestern United States of America, substantial differences are found between the two reanalyses. Since different reanalyses from roughly the same generation are in principle equally representative of the "truth" (?), the models are here evaluated twice in order to obtain a robust picture of their performance.
In the present article file, the evaluation results w.r.t. to ERA-Interim are mapped and deviations from the evaluation against JRA-55 in the 3 relevant regions are pointed out in the text. In the remaining regions, reanalysis uncertainty plays a minor role.

245
Nevertheless, for the sake of completeness, the interested reader can see in the full atlas of the JRA-55-based evaluation in ::: was ::::: added ::: to the supplementary material to this study. For a quick overview of the results, Table 1 indicates whether a given model closer agrees with ERA-Interim or JRA-55 in the 3 sensitive regions. In the following, this is referred to as "reanalysis affinity".
Figure 2 also shows that the LWT usage criterion defined in Section 3.1 is met almost everywhere in the domain, except in 250 the high-mountain areas of central Asia (grey areas within the performance maps indicate that the criterion is not met). This region is governed by the monsoon rather than the turnover of dynamic low-and high pressure systems the LWT approach was developed for. It is thus justified to use the approach over such a large domain.
Within the ACCESS model family, version 1.0 performs best (see Figure 3). The corresponding error and ranking patterns 305 are virtually identical to HadGEM2-ES and HadGEM2-CC, which is due to the same AGCM used in these three models (HadGAM2). The 3 more independent versions of ACCESS (1.3, CM2 and ESM1.5) roughly share the same error pattern, which differs from ACCESS1.0 in some regions. While the 3 independent developments perform worse in the North Atlantic and western North Pacific, they do better in the eastern North Pacific off the coast of Japan and, in case of ACCESS-CM2, also in the high mountain areas of central Asia and the Mediterranean :::: over ::: the :::::::::::: Mediterranean :::: Sea. In the latter two regions, for showing virtually no sensitivity in the Gobi desert in case of the ACESS-ESM1.5 (compare Figure 3 with the "figs-refjra55/maps/rank" folder in the supplementary material). Figure 4 shows the respective results for the models developed in North America. Each of the four model families are built upon independent and long-standing research lines.
While GFDL-CM3 only ranks well in an area ranging from the Great Plains to the central North Pacific, GFDL-CM4 yields balanced results over the entire NH and is one of the best models considered here. Notably, GFDL-CM4 also performs well over central Asia and in an area ranging from the Black Sea to the Middle East, which is where most of the other models 330 perform less favourable. Note also that GFDL's Modular Ocean Model (MOM) is the standard OGCM in all ACCESS models and is also being used in BCC-CSM2-MR :::: used :: in ::: the ::::::::: BCC-CSM :::::: model ::::::: versions (see Table 1 for details).
The two versions are identical except for the ocean component: HYCOM was used in GISS-E2-H and Russel Ocean in GISS-

335
E2-R (?). Russel Ocean was then developed to GISS Ocean v1 for use in GISS-E2.1-G (?), the CMIP6 model version assessed here , which however was run without the aforementioned chemistry and aerosol modules (note that the 6-hourly SLP data for the more complex model versions contributing to CMIP6 were not available from the ESGF data portals). All these versions comprise a relatively modest resolution for the atmosphere and ocean and no refinement was undertaken from CMIP5 to 6.
However, many parametrization schemes were improved. GISS-E2.1-G generally ranks better than its predecessors, except in 340 eastern Siberia and China, where very good ranks are obtained by the two CMIP5 versions (see Figure 4). The small differences between the results for GISS-E2-H and R might stem from internal model variability (see also Section 4.9) or indeed ::: and from the use of two distinct OGCMs. Unfortunately, all versions of the GISS model ::::::: GISS-E2 ::::: model :::::::: versions ::::::::: considered ::: here : are plagued by pronounced performance differences from one region to another, meaning that they are less balanced than e.g. GFDL-CM4.
Results indicate a comparatively poor performance for both CCSM4 and CanESM2. Exceptions are found along the North CanESM2 ranks well (see Figure 4). As for the GISS models, both CCSM4 and CanESM2 are also plagued by large regional performance differences.
Regarding the models' reanalysis affinity, GFDL-CM3 thrives towards ERA-Interim in the seas around Greenland and 365 towards JRA-55 in the Gobi desert, while being almost insensitive to reanalysis choice in the southwestern United States (compare Figure 4 with the "figs-refjra55/maps/rank" folder in the supplementary material to this article). GFDL-CM4 has similar reanalysis affinities, but largely improves (by up to 20 ranks) in the southwestern United States when evaluated against JRA-55. Results for GISS-E2-H and GISS-E2-R are slightly closer to ERA-Interim in the southwestern U.S. and otherwise virtually insensitive to reanalysis choice. GISS-E2-1-G is virtually insensitive in all 3 regions. CanESM2 ranks consistently 370 better if compared with JRA-55, with a stunning improvement of up to 30 ranks in the southwestern United States, and CCSM4 slightly thrives towards ERA-Interim in all 3 regions.

385
Within the CNRM model family, CNRM-CM5 is found to perform very well except in the central North Pacific, the southern USA and in a subpolar belt extending from Baffinland in the West to western Russia in the East (see Figure 5). This includes a good performance over the Rocky Mountains and central Asia. From CNRM-CM5 to CNRM-CM6-1, performance gains are obtained in the central North Pacific, the southern USA, Scandinavia and western Russia which, however, are compensated by performance losses in the entire eastern North Atlantic and in an area covering Manchuria, Korea and Japan. A similar 390 picture is obtained for CNRM-ESM2-1 : , whereas a performance loss is observed for for CNRM-CM6-1-HR. This is surprising since, in addition to improved parametrization schemes, the model resolution in the atmosphere and ocean was particularly increased in the latter model version. Under these circumstances, CNRM-CM6-1-HR is actually the only model suffering clear performance losses from CMIP5 to 6. The reasons for this are unknown and should be assessed in future studies.

490
Pacific basin including western North America, the subtropical North Atlantic to the west of the Strait of Gibraltar, and the regions around Greenland and the Caspian Sea. It is in these "weak" regions where the largest performance gains are obtained from MRI-ESM1 to MRI-ESM2. As a results, in a zonal belt extending from approximately 50 • N to 75 • N, MRI-ESM2 is one of the best performing models considered here.
For the MIROC family, a heterogeneous picture is obtained. While MIROC5 and MIROC-ESM clearly thrive towards ERA-

HadGEM3-GC31-MM links up with EC-Earth3 in what is considered the "best model" if model complexity was not argument
(see also "figs-refjra55/as-figure-10-but-wrt-jra55.pdf" in the supplementary material). For the NorESM family, different reanalysis affinities are obtained for the 3 regions. While NorESM1 is closer to JRA-55 in all of them, NorESM2-LM is closer 675 to ERA-Interim in the southwestern U.S., : but closer to JRA-55 in the Gobi ::::: desert. NorESM2-MM is generally less sensitive to reanalysis uncertainty, with some affinity to ERA-Interim in the southwestern U.S.

Model contributions from Russia and South Korea
The except for the very few vertical layers used in the atmosphere (see Table 1). As shown by Figure 9, INM-CM4 performs well to very well in the eastern North Atlantic, northern Europe and the Gulf of Alaska, regularly over northern China and Corea and poorly over the remaining regions of the NH. It is thus marked by large performance differences from one region to another. , which might be due to the ocean model taken from CESM1 (POP is used instead of NEMO or MICOM, see Table 1), or precisely due to the effects of the particular parametrization schemes mentioned above. Although the error magnitude of SAM0-UNICON is similar to CMCC-CM-SR5, SAM0-UNICON exhibits weaker regional performance differences, making it the more balanced model out of the two. In most regions of the NH, SAM0-UNICON yields better results than NorESM2-LM but is outperformed by NorESM2-MM.

695
While INM-CM4 compares better with JRA-55 in the 3 regions sensitive to reanalysis uncertainty, SAM0-UNICON is in closer agreement with ERA-Interim, there (compare Figure 9 with the "figs-refjra55/maps/rank" folder in the supplementary material).
4.9 Summary boxplotand , : role of internal model ::::::::: resolution, ::::: model :::::::::: complexity :::: and ::::::: internal : variability For each model version listed in Table 1, the spatial distribution of the pointwise MAE values can also be represented with a 700 boxplot instead of a map, which allows for an overarching performance comparison visible at a glance (see Figure 10 :: 11 : for the evaluation against ERA-Interim). Here, the standard configuration of the boxplot is applied. For a given sample of MAE values corresponding to a specific model, the box refers to the interquartile range (IQR) of that sample and the horizontal bar to the median. Whiskers are drawn at the 75th percentile + 1.5 × IQR and at the 25th percentile -1.5 × IQR. All values outside this range are considered outliers (indicated by dots). Additional ::: Four ::::::::: additional boxplots are provided for the joint MAE samples of 1) all CMIP5 model versions, 2) all CMIP6 model versions, 3) all model versions considered ESMs (ESM) and 4) all other model versions (AOGCM) ::: the :::: more :: and ::: the :::: less ::::::: complex ::::: model :::::::: versions :::: used :: in :::::: CMIP5 ::: and :: 6. In these 4 casesthe , : outliers are not plotted for the sake of simplicity. The acronyms of the coupled model configurations, as well as their participation in either CMIP5 or 6 (indicated by the final integer), are shown below the x-axis. Above ::::: Along the x-axis, the names of the coupled models' atmospheric components are also shown since some of them are shared by various research institutions (see also Table   710 1).

:
A : virtual lack of outliers is another remarkable advantage of NorESM2-MM. MRI-ESM2 and GDFL-CM4 are also relatively robust to outliers, but less so than NorESM2-MM. The fewest number of outliers among all models is obtained for EC-Earth, irrespective of the model version.
It is also becomes evident :::: found : that all members of the EC-Earth model family yield best results for any LWT (observe the proximity of the yellow cluster to the perfect score indicated by the black half cycle). Recall, however, that no EC-Earth version actually fullfils the criterion of an ESM since ocean biogeochemistry is not considered. :::::: circle). : Within the group of proximity to the EC-Earth Cluster for most LWTs. The Hadley Centre and ACCESS models (filled with orange and dark blue) form another cluster that generally performs very well for most LWTs. However, the spatial standard deviation of the 3 eastern LWTs (cyclonic, anticyclonic and directional) is overestimated by these "Commonwealth" models(the Commonwealth is here referred to for illustrative purposes and does not reflect any political opinion) :::::: models, which is indicated by a standard deviation 810 ratio ≈1.25, while values close to unity or below are obtained for the remaining models. It is also worth mentioning that not only ACCESS1.0 but also the other, more independently developed ACCESS versions pertain to the Commonwealth ::: this cluster, which indicates the common origin of their atmospheric component (the Met Office Hadley Centre) even at the level of detail of specific weather types. For all other models, the LWT-specific results do not largely deviate from the overall MAE results shown in Section 4, meaning that overall performance is generally also a good indicator of LWT-specific performance.

815
As an example, MIROC-ESM (the blue-green cross), IPSL-CM5A-LR and IPSL-CM5A-LR (the grey cross and grey plus) are located in the "weak" area of the Taylor diagram for each of the 27 LWTs, which is in line with the likewise weak overall performance obtained for these models in Section 4.
The corresponding results for the model evaluation against JRA-55 are generally in close agreement with those mentioned above, except for the EC-Earth model family performing slightly less favourable (see "figs-refjra55/taylor" folder in the sup-820 plementary material to this article).
This study also shows that the models' complexity, here defined as the number of realms simulated online, should be taken into account for a correct interpretation of the results. Namely, comprehensive ESMs such as HadGEM2-ES, MRI-ESM1, MRI-ESM2 and NorESM2 are by construction more sensitive to model uncertainties than traditional AOGCM configurations.
Since ESMs are in principle preferable to AOGCMs, a discussion about how model complexity should influence the choice of driving GCMs in regional climate studies is needed. A separate ranking of the models pertaining to each group would be a simple solution (see "figs-refinterim-aogcm" and "figs-refinterim-esm" folders in the supplementary material to this article).
Code and data availability. Author contributions. All working steps were accomplished by SB.
Competing interests. The author declares no competing interests.