A circulation-based performance atlas of the CMIP5 and 6 models for regional climate studies in the northern hemisphere mid-to-high latitudes

Global Climate Models are a keystone of modern climate research. In most applications relevant for decision making, they are assumed to provide a plausible range of possible future climate states. However, these models have not been originally developed to reproduce the regional-scale climate, which is where information is needed in practice. To overcome this dilemma, two general efforts have been made since their introduction in the late 1960ies. First, the models themselves have been steadily improved in terms of physical and chemical processes, parametrization schemes, resolution and implemented climate system 5 components, giving rise to the term “Earth System Model”. Second, the global models’ output has been refined at the regional scale using Limited Area Models or statistical methods in what is known as dynamical or statistical downscaling. For both approaches, however, it is difficult to correct errors resulting from a wrong representation of the large-scale circulation in the global model. Dynamical downscaling also has a high computational demand and thus cannot be applied to all available global models in practice. On this background, there is an ongoing debate in the downscaling community on whether to thrive away 10 from the “model democracy” paradigm towards a careful selection strategy based on the global models’ capacity to reproduce key aspects of the observed climate. The present study attempts to be useful for such a selection by providing a performance assessment of the historical global model experiments from CMIP5 and 6 based on recurring regional atmospheric circulation patterns, as defined by the Jenkinson-Collison approach. The latest model generation (CMIP6) is found to perform better on average, which can be partly explained by a moderately strong statistical relationship between performance and horizontal 15 resolution in the atmosphere. A few models rank favourably over almost the entire northern hemisphere mid-to-high latitudes. Internal model variability only has a small influence on the model ranks. Reanalysis uncertainty is an issue in Greenland and the surrounding seas, the southwestern United States and the Gobi desert, but is otherwise generally negligible. Along the study, the prescribed and interactively simulated climate system components are identified for each applied coupled model configuration and a simple codification system is introduced to describe model complexity in this sense. 20


Introduction
General Circulation Models (GCMs) are numerical models capable to simulate the temporal evolution of the global atmosphere or ocean. This is done by integrating the equations describing the conservation laws of physics along time as a function of version pairs from CMIP5 to CMIP6 are considered, but also model versions either not having a predecessor in CMIP5 or a successor in CMIP6. In the most favourable case, two versions of a given model are available for both CMIP5 and 6: A higherresolution setup considering fewer realms (the AOGCM configuration), complemented by a more complex setup including more component models, usually run with a lower resolution than the AOGCM version. 95 An overview of the 56 applied model versions is provide in Table 1. The table provides information about the component AGCMs and OGCMs, their horizontal and vertical resolution, run specifications and complexity codes described in Section 3.3.

130
The 27 classes are then defined following Jones et al. (1993) and Jones et al. (2013):

Applied GCM performance measures
To measure GCM performance, the Mean Absolute Error (MAE) of the n = 27 relative LWT frequencies obtained from a given model (m) w.r.t. to those obtained from the reanalysis (o) are calculated at a given grid-box: The MAE is then used to rank the 56 distinct models at this grid-box. The lower the MAE, the lower the rank and the better the model. After repeating this method for each grid-box of the NH, both the MAE values and ranks are plotted for each 175 individual model on a polar stereographic projection.
In addition to the MAE measuring overall performance, the specific model performance for each LWT is also assessed.
This is done because, by definition of the MAE, errors occurring in the more frequent LWTs are penalized more than those occurring in the rare LWTs. Hence, a low MAE might mask errors in the least frequent LWTs. For a LWT-specific evaluation, the simulated frequency map for a given LWT and model are compared with the corresponding map from the reanalysis by 180 means of the Taylor Diagram (Taylor, 2001). This diagram compares the spatial correspondence of the simulated and observed (or "quasi-observed" since reanalysis data are used) frequency patterns by means of 3 complementary statistics. These are the Pearson correlation coefficient (r), the standard deviation ratio (ratio = σ m /σ o ), with σ m and σ o being the standard deviation of modelled and observed frequency patterns, and the normalized centered root mean-square error (CRMSE): 185 , with n = 2016 grid-boxes covering the NH mid-to-high latitudes and cm and co the modelled and observed frequency patterns after subtracting their own mean value (i.e. both the minuend and subtrahend are anomaly fields, "c" refers to centred).
Normalization enables for comparison with other studies using the same method.

Model complexity in terms of considered climate system components
In addition to the model performance assessment, a straightforward approach is followed to describe the complexity of the 190 coupled model configurations in terms of considered climate system components. The following ten components are taken into account: 1. Atmosphere, 2. Land-surface, 3. Ocean, 4. Sea-ice, 5. Vegetation properties, 6. Terrestrial carbon-cycle processes, 7. Aerosols, 8. Atmospheric Chemistry, 9. Ocean biogeochemistry and 10. Ice sheet dynamics. An integer is assigned to each of these components depending on whether it is not taken into account at all (0), represented by an interactive model feeding back on at least one other component (2), or anything in between (1) including prescription from external files, semi-interactive 195 approaches or components simulated online but without any feedback on other components.
As an example, MRI-ESM's complexity code is 2222122220, indicating interactive atmosphere, land-surface, ocean and sea-ice models, prescribed vegetation properties, interactive terrestrial carbon-cycle, aerosol, atmospheric chemistry and ocean biogeochemistry models, and no representation of ice sheet dynamics. For each of the 56 participating coupled model configurations, the reference article(s) and source attributes inside the netCDF files from ESGF were assessed in order to obtain an 200 initial "best-guess" complexity code. This code was then sent by e-mail to the respective modelling group for confirmation or correction (see Acknowledgements). Out of the 19 groups contacted within this survey, 17 confirmed or corrected the code and 2 did not answer. Among the 17 groups providing feebback, a single scientist from one group was not sure whether the proposed method is suitable to measure model complexity, but did not reject it either. In light of the many participating scientists (up to three individuals per group were contacted to enhance the probablity of a response), this is considered a favourable 205 feedback. The final codes are listed in Table 1, column 7. The sum of the integers is here taken as an estimator for the complexity of the coupled model configuration and is referred to as "complexity score" in the forthcoming. In the light of various available definitions for the term "Earth System Model" (Collins et al., 2011;Yukimoto et al., 2011;Jones, 2020), this is a flexible approach used as a starting point for further specifications in the future.
Note that the here defined complexity score only measures the number and treatment of the climate system components 210 considered by a given coupled model configuration. It does not measure the comprehensiveness of the individual component models, nor the coupling frequency or treatment of the forcing datasets, among others. The score should thus be interpreted as an overarching and a priori indicator of climate sytem representativity, and by no means can compete with in-depth studies treating model comprehenisveness for single climate system components (Séférian et al., 2020). For further details on the 56 coupled model configurations considered here, the interested reader is referred to the reference articles listed in Table 1

Overall model performance results
In Figure 2, the MAE of JRA-55 w.r.t. ERA-Interim is mapped (panel a), complemented by the corresponding rank within the multi-model ensemble plus JRA-55 (panel b). In the ideal case, the MAE for JRA-55 is lower than for any of the 56 CMIP 230 models, which means that the alternative reanalysis ranks first and that a change in the reference reanalysis does not influence the model ranking. This result is indeed obtained for a large fraction of the NH. However, in the Gobi desert, in Greenland and the surrounding seas, and particularly in the southwestern United States of America, substantial differences are found between the two reanalyses. Since different reanalyses from roughly the same generation are in principle equally representative of the "truth" (Sterl, 2004), the models are here evaluated twice in order to obtain a robust picture of their performance. In the present 235 article file, the evaluation results w.r.t. to ERA-Interim are mapped and deviations from the evaluation against JRA-55 in the 3 relevant regions are pointed out in the text. In the remaining regions, reanalysis uncertainty plays a minor role. Nevertheless, for the sake of completeness, the full atlas of the JRA-55-based evaluation was added to the supplementary material to this study. For a quick overview of the results, Table 1 indicates whether a given model closer agrees with ERA-Interim or JRA-55 in the 3 sensitive regions. In the following, this is referred to as "reanalysis affinity".
240 Figure 2 also shows that the LWT usage criterion defined in Section 3.1 is met almost everywhere in the domain, except in the high-mountain areas of central Asia (grey areas within the performance maps indicate that the criterion is not met). This region is governed by the monsoon rather than the turnover of dynamic low-and high pressure systems the LWT approach was developed for. It is thus justified to use the approach over such a large domain.
Grouped by their geographical origin, Sections 4.1 to 4.8 describes the composition of the 56 participating coupled models in 245 terms of their atmosphere, land-surface, ocean and sea-ice models in order to make clear whether there are shared components between nominally different models that might explain common error structures. The names of all other component models are documented in the Python function get_historical_metadata.py contained in https://doi.org/10.5281/zenodo.4452080. Then, the regional error and ranking details are provided. In Section 4.9, these results are summarized in a single boxplot and put into relation with the resolution setup of the atmosphere and ocean component models. The role of internal model variability is also 250 assessed there. A complete list of all participating component models is provided in the aforementioned Python function.
The first result common to all models is the spatial structure of the absolute error expressed by the MAE. Namely, the models tend to perform better over ocean areas than over land and perform poorest over high-mountain areas, particularly in central Asia. Further regional details are documented in the following sections.

255
The atmosphere, land-surface and ocean dynamics in the Hadley Centre Global Environment Model version 2 (HadGEM2) are represented by the HadGAM2, MOSES2 and HadGOM2 models, respectively. Both the CC and ES model versions comprise interactive vegetation properties, land carbon and ocean carbon cycle processes and aerosols. The ES version also includes an interactive atmospheric chemistry which, in turn, is prescribed in the CC configuration, making it slightly less complex (Collins et al., 2011;Martin et al., 2011). This centre's model contributions to CMIP6 are following the concept of seamless prediction 260 (Palmer et al., 2008), in which lessons learned from short-term numerical weather forecasting are exploited for the improvement of longer-term predictions/projections up to climatic time-scales, using a "unified" or "joint" model for all purposes (Roberts et al., 2019). For atmosphere and land-surface processes, these are the Unified Model Global Atmosphere 7 (UM-GA7) AGCM and the Joint UK Land Environment Simulator (JULES) (Walters et al., 2019). However, the specific CMIP6 model version considered here (HadGEM3-GC31-MM) is a very high-resolution AOGCM configuration comprising only one 265 further interactive component (aerosols). In comparison with HadGEM2-ES and CC, HadGEM3-GC31-MM is therefore less complex.
With nearly identical error and ranking patterns associated with the aforementioned almost identical configuration, already the two model versions used in CMIP5 (HadGEM2-CC and ES) yield a good to very good performance which, for the European sector, is in line with Perez et al. (2014) and Stryhal and Huth (2018). Only a close look reveals slightly lower errors for the 270 ES version, particularly in a region extending from western France to the Ural mountains (see Figure 3).

Model contributions from North America
The Geophysical Fluid Dynamics Laboratory Climate Models 3 and 4 (GFDL-CM3 and CM4) are composed of in-house atmosphere, land-surface, ocean and sea-ice models and comprise interactive vegetation properties, aerosols and atmospheric 305 chemistry (Griffies et al., 2011;Held et al., 2019). GFDL-CM4 also includes simple land and ocean carbon cycle representations which, however, do not feed back on other climate system components. From CM3 to CM4 a considerable resolution increase was underaken, except for a reduction in the AGCM's vertical levels, and this actually pays off in terms of model performance (see Figure 4). While GFDL-CM3 only ranks well in an area ranging from the Great Plains to the central North Pacific, GFDL-CM4 yields balanced results over the entire NH mid-to-high latitudes and is one of the best models considered 310 here. Notably, GFDL-CM4 also performs well over central Asia and in an area ranging from the Black Sea to the Middle East, which is where most of the other models perform less favourable. Note also that GFDL's Modular Ocean Model (MOM) is the standard OGCM in all ACCESS models and is also used in the BCC-CSM model versions (see Table 1 for details).
All ), the CMIP6 model version assessed here (note that the 6-hourly SLP data for the more complex model versions contributing to CMIP6 were not available from the ESGF data portals). All these versions comprise a relatively modest resolution for the atmosphere and ocean and no refinement was undertaken from CMIP5 to 6. However, many parametrization schemes were improved. GISS-E2.1-G generally ranks better than its predecessors, except in eastern Siberia 320 and China, where very good ranks are obtained by the two CMIP5 versions (see Figure 4). The small differences between the results for GISS-E2-H and R might stem from internal model variability (see also Section 4.9) and from the use of two distinct OGCMs. Unfortunately, all GISS-E2 model versions considered here are plagued by pronounced performance differences from one region to another, meaning that they are less balanced than e.g. GFDL-CM4.  (Gent et al., 2011;Craig et al., 2012). The model version considered here was used in CMIP5 and includes interactive vegetation properties and land carbon cycle processes, whereas aerosols are prescribed. During the course of the last decade, CCSM4 has been further developed into CESM1 and 2 (Hurrell et al., 2013;Danabasoglu et al., 2020) which, due to data availability issues, can unfortunately not be assessed here (the respective data The Canadian Earth System Model version 2 (CanESM2) is composed of the CanAM4 AGCM, the CLASS2.7 land surface model, the CanOM4 OGCM and the CanSIM1 sea-ice model (Chylek et al., 2011). It contributed to CMIP5 and comprises 335 interactive vegetation properties, land and ocean carbon cycle processes and aerosols, whilst the ice sheet area is prescribed.
Results indicate a comparatively poor performance for both CCSM4 and CanESM2. Exceptions are found along the North American west coast and the Labrador Sea, where both models perform well; in the central to eastern subtropical Pacific and in northwestern Russia plus Finland, where CCSM4 performs well; and in Quebec, Scandinavia and eastern Siberian, where CanESM2 ranks well (see Figure 4). As for the GISS models, both CCSM4 and CanESM2 are also plagued by large regional 340 performance differences.
Regarding the models' reanalysis affinity, GFDL-CM3 thrives towards ERA-Interim in the seas around Greenland and towards JRA-55 in the Gobi desert, while being almost insensitive to reanalysis choice in the southwestern U.S. (compare Figure 4 with the "figs-refjra55/maps/rank" folder in the supplementary material). GFDL-CM4 has similar reanalysis affinities, is obtained for CNRM-ESM2-1, whereas a performance loss is observed for for CNRM-CM6-1-HR. This is surprising since, in addition to improved parametrization schemes, the model resolution in the atmosphere and ocean was particularly increased 365 in the latter model version.
All IPSL-CM model versions participating in CMIP5 and 6 comprise interactive vegetation properties and terrestrial carbon cycle processes, as well as prescribed aerosols and atmospheric chemistry. Ocean biogeochemistry processes are simulated online, but do not feed-back on other components of the climate system. A simple representation of ice sheet dynamics was included to IPSL-CM6A-LR (Boucher et al., 2020;Hourdin et al., 2020;Lurton et al., 2020), but is absent in IPSL-CM5A-LR 370 and MR (Dufresne et al., 2013). The two model versions used in CMIP5 have been run with a modest horizontal resolution in the atmosphere (LMDZ) and ocean (NEMO). This changed for the better in IPSL-CM6A-LR, where a more competitive resolution was applied and all component models were improved. The result is a considerable performance increase from CMIP5 to CMIP6. Whereas both IPSL-CM5A-LR and IPSL-CM5A-MR perform poorly, IPSL-CM6A-LR does much better virtually anywhere in the NH mid-to-high latitudes, a finding that is insensitive to the effects of internal model variability (see 375 Section 4.9).
The quite different results between the CNRM and IPSL models indicate that the common ocean component (NEMO) only marginally affects the simulated atmospheric circulation as defined here. All CNRM models, and also IPSL-CM6A-LR, thrive towards Interim in the southwestern U.S. and towards JRA-55 in the seas around Greenland and the Gobi desert. IPSL-CM5A-LR and MR are virtually insensitive to reanalysis choice (compare Figure 5 with the "figs-refjra55/maps/rank" folder in the 380 supplementary material).

Model contributions from China, Taiwan and India
The are prescribed in this model configuration. For FGOALS-g3, the model version contributing to CMIP6, the AGCM was updated to GAMIL3, including convective momentum transport, stratocumulus clouds, anthropogenic aerosol effects and an improved 400 boundary layer scheme as new features (Li et al., 2020). The OGCM and coupler were also updated (to LICOM3 and CPL7) and a modified version of CLM4.5 (called CAS-LSM) is used as land surface model, whereas the sea-ice model is practically identical to that used in the g2 version. In the g3 version, vegetation properties, terrestrial carbon cycle processes and aerosols are prescribed. While FGOALS-g2 is one of the worst performing models considered here, FGOALS-g3 performs considerably better, particularly over the northwestern and central North Atlantic Ocean, western North America and the North Pacific
The Nanjing University of Information Science and Technology Earth System Model version 3 (NESM3) is a new CMIP participant and is entirely built upon component models from other institutions (Cao et al., 2018). Namely, the AGCM, landsurface model, coupling software and atmospheric resolution are adopted from MPI-ESM1.2-LR (see Section 4.6) whereas NEMO3.4 and CICE4.1 are taken from IPSL and NCAR, respectively (Cao et al., 2018). Vegetation properties and terrestrial  Results indicate a systematic performance increase from MIROC5 to MIROC6 in the presence of large performance differences from one region to another (see Figure 6). Both models perform very well over the Mediterranean, northwestern North America active component models for terrestrial carbon cycle processes, aerosols, atmospheric photochemistry and ocean biogeochemistry, whereas vegetation properties are prescribed (Yukimoto et al., 2011). In the CMIP6 version (MRI-ESM2), terrestrial and ocean carbon cycle processes are no longer interactive but prescribed from external files (Yukimoto et al., 2019). Noteworthy, each model component and also the coupler have been originally developed by MRI and the coupling applied in these models is particularly comprehensive (Yukimoto et al., 2011). The comparatively high model resolution applied in MRI-ESM1 was 455 further refined in MRI-ESM2 by adding more vertical layers, particularly in the atmosphere (see Table 1 Figure 10). This might be due to the use of different ocean models (see Table 1), or precisely due to the effects of the particular parametrization schemes mentioned above. Although the error magnitude of SAM0-UNICON is similar to CMCC-CM-SR5, SAM0-UNICON exhibits weaker regional performance differences, making it the more balanced model out 475 of the two. In most regions of the NH mid-to-high latitudes, SAM0-UNICON yields better results than NorESM2-LM but is outperformed by NorESM2-MM.
The MRI models generally agree closer with ERA-Interim than with the JRA-55, which is surprising since JRA-55 was also developed at JMA (compare Figure 7 with the "figs-refjra55/maps/rank" folder in the supplementary material). Results show that the vertical resolution increase in the atmosphere undertaken from MPI-ESM-LR to MR (the CMIP5 versions) sharpens the regional performance differences rather than contributing to an improvement (see Figure 8). is comparable to other CMIP5 models, except for the very few vertical layers used in the atmosphere (see Table 1). As shown in Figure  it is likewise one of the best models considered here. Note that this model, due to identical model components for all realms except the ocean, is a good estimator for the performance of CESM1, which unfortunately cannot be assessed here due to data availability issues. The error an ranking patterns of CMCC-ESM2 are similar to CMCC-CM2-SR5, yielding fewer regional differences and a much better performance over the central eastern North Atlantic Ocean. Hence, CMCC-ESM2 is not only the most sophisticated but also the best performing model version in this family.

575
The Norwegian Earth System Model (NorESM) shares substantial parts of its source code with the NCAR model family ( properties and atmospheric chemistry are prescribed, and the coupler has been updated from CPL7 to CIME, which is also used in CESM2. In the present study, the basic configuration NorESM2-LM is evaluated together with NorESM2-MM, the latter using a much finer horizontal resolution in the atmosphere (see Table 1). The corresponding maps in Figure 10  the Urals which, further to the East, re-emerges over the Baikal region. In the higher-resolution version NorESM2-MM, these errors are further reduced to a large degree, with the overall effect of obtaining one of the best models considered here.
In the 3 regions of pronounced reanalysis uncertainties, CMCC-CM is in closer agreement with JRA-55 whereas CMCC-CM2-SR5 and CMCC-ESM2 are more similar to ERA-Interim, reflecting the profound change in the model components from 595 CMIP5 to 6 (compare Figure 10 with the "figs-refjra55/maps/rank" folder in the supplementary material). For the NorESM family, different reanalysis affinities are obtained for the 3 regions. While NorESM1 is closer to JRA-55 in all of them, NorESM2-LM is closer to ERA-Interim in the southwestern U.S., but closer to JRA-55 in the Gobi desert. NorESM2-MM is generally less sensitive to reanalysis uncertainty, with some affinity to ERA-Interim in the southwestern United States.

Summary boxplot, role of model resolution, model complexity and internal variability 600
For each model version listed in Table 1, the spatial distribution of the pointwise MAE values can also be represented with a boxplot instead of a map, which allows for an overarching performance comparison visible at a glance (see Figure 11 for the evaluation against ERA-Interim). Here, the standard configuration of the boxplot is applied. For a given sample of MAE values corresponding to a specific model, the box refers to the interquartile range (IQR) of that sample and the horizontal bar to the median. Whiskers are drawn at the 75th percentile + 1.5 × IQR and at the 25th percentile -1.5 × IQR. All values outside this range are considered outliers (indicated by dots). Four additional boxplots are provided for the joint MAE samples of the more complex model versions (reaching a score ≥ 14) and the less complex versions used in CMIP5 and 6. In these 4 cases, outliers are not plotted for the sake of simplicity. The acronyms of the coupled model configurations, as well as their participation in either CMIP5 or 6 (indicated by the final integer), are shown below the x-axis. Along the x-axis, the names of the coupled models' atmospheric components are also shown since some of them are shared by various research institutions (see also Table   610 1).
Results indicate a performance gain for most model families when switching from CMIP5 to 6 (available model pairs are located next to each other in Figure 11). The largest improvements are obtained for those models performing relatively poorly in  Figure 11 and Table   1).
A virtual lack of outliers is another remarkable advantage of NorESM2-MM. MRI-ESM2 and GDFL-CM4 are also relatively robust to outliers, but less so than NorESM2-MM. The fewest number of outliers among all models is obtained for EC-Earth, irrespective of the model version.

635
The model evaluation against JRA-55 reveals similar results (see "figs-refjra55/as-figure-10-but-wrt-jra55.pdf" in the supplementary material), indicating that uncertain reanalysis data in the 3 relevant regions detected above do do not substantially affect the hemispheric-wide statistics. What is noteworthy, however, is the slight but nevertheless visible performance loss for the EC-Earth model family, bringing EC-Earth3 approximately to the performance level of HadGEM3-GC31-MM. If evaluated against JRA-55, all EC-Earth model versions also comprise more outlier results. EC-Earth's affinity to ERA-Interim might be 640 explained by the fact that this reanalysis was also built with ECMWF IFS. Table 2 provides the rank correlation coefficients between the median MAE w.r.t. to ERA-Interim for each model, corresponding to the horizontal bars within the boxes in Figure  from the data array stored therein. Note that due to an unstructured grid in one ocean model, the breakdown in zonal and meridional resolution cannot be made in this realm.
As can be seen from Table 2, average model performance is closer related to the horizontal than to the vertical resolution in the atmosphere. Associations with the ocean resolution are weaker, as expected, but nevertheless significant. Since the resolution increase for most models has gone hand in hand with improvements in the internal parameters (parametrization, 655 model physics, bugs) it is difficult to say which of these two effects is more influential on model performance. However, most of the models undergoing a version change without resolution increase do not experience a clear performance gain either. This is observed for the 3 ACCESS versions using the same AGCM (i.e. GA in 1.3, CM2 and ESM1-5) and also for the 3 model versions from GISS, all comprising the same horizontal resolution in the atmosphere within their respective model family. Likewise, CNRM-CM6-1 and MPI-ESM1-2-LR even perform slightly worse than their predecessors (CNRM-CM5 660 and MPI-ESM-LR), meaning that the update is counterproductive for their performance (see Figure 11). This points to the fact that resolution is likely more influential on performance than model updates as long as the latter are not too substantial.
Interestingly, the relationship between the models' median performance and the horizontal mesh size of their atmospheric component is non-linear (rs = -0.72), with an abrupt shift towards better results at approximately 25.000 grid points (see Figure   13a).
665 Figure 13b shows the complexity score described in Section 3.3, plotted against the coupled models' median performance.
The figure reveals that the best performing model family (EC-Earth) is not the most complex one, and that some model configurations performing less well are particularly complex (e.g. CNRM-ESM2-1). Also, performance is generally unrelated to complexity, which is an argument in favour of including more component models to reach a more complete representation of the climate system. Interestingly, for four out of five possible comparisons, the most complex model configuration In comparison with the inter-model variability discussed above, the internal model variability (or "intra-model variability") is much smaller and only marginally affects the results, which for all runs of a given model version are in close agreement even for the outliers (see Figure 12). Albeit the use of alternative model runs might lead to slight shifts in the ranking order at the grid-box scale, a "good" rank would not change into an "average" or even "bad" one. However, while internal model variability only plays a minor role in the context of the present study, some specific models indeed seem to be more sensitive to initial 680 conditions uncertainty (which is where ensemble spread stems from in the experiments considered here) than others, with NorESM2-LM (the lower resolution version only) and NESM3 seemingly being less stable in this sense. Remarkably, MPI-ESM1.2-HR is found to be stable in spite of the fact that it is considered a more "unstable" configuration by its development team because the carbon cycle had not been run to equilibrium for this version (Mauritsen et al., 2019). It is also good news that HadGEM2-ES, known to perform well for r1i1p1 and consequently used as baseline for many downscaling applications 685 and impact studies of the past (Gutiérrez et al., 2013;Perez et al., 2014;San-Martín et al., 2016), performs nearly identical for r2i1p1. Lastly, the large performance increase from IPSL-CM5A-LR to IPSL-CM6A-LR is likewise robust to the effects of internal variability.

Specific model performance for each Lamb weather type
In Figures 14 to 16, the simulated, hemispheric-wide frequency pattern for a given model and LWT is compared with the 690 respective quasi-observed frequency pattern obtained from ERA-Interim by using a normalized Taylor diagram (Taylor, 2001).
The first thing to note here is that, for most LWTs, the models tend to cluster in a region that would be generally considered a good result. Except for some outlier models and individual LWTs, the pattern correlation lies in between 0,6 and 0.9, the standard deviation ratio is not too far from unity (= best result) and the centred normalized RMSE ranges between 0.25 and 0.75 × the standard deviation of the observed frequency pattern.

695
It is also found that all members of the EC-Earth model family yield best results for any LWT (observe the proximity of the yellow cluster to the perfect score indicated by the black half circle). Within the group of the more complex models, NorESM2-MM (the rose triangle pointing to the left) performs best and actually lies in close proximity to the EC-Earth Cluster for most LWTs. The Hadley Centre and ACCESS models (filled with orange and dark blue) form another cluster that generally performs very well for most LWTs. However, the spatial standard deviation of the 3 eastern LWTs (cyclonic, anticyclonic 700 and directional) is overestimated by these models, which is indicated by a standard deviation ratio ≈1.25, while values close to unity or below are obtained for the remaining models. It is also worth mentioning that not only ACCESS1.0 but also the other, more independently developed ACCESS versions pertain to this cluster, which indicates the common origin of their atmospheric component (the Met Office Hadley Centre) even at the level of detail of specific weather types. For all other models, the LWT-specific results do not largely deviate from the overall MAE results shown in Section 4, meaning that overall 705 performance is generally also a good indicator of LWT-specific performance. As an example, MIROC-ESM (the blue-green cross), IPSL-CM5A-LR and IPSL-CM5A-LR (the grey cross and grey plus) are located in the "weak" area of the Taylor diagram for each of the 27 LWTs, which is in line with the likewise weak overall performance obtained for these models in Section 4.
The corresponding results for the model evaluation against JRA-55 are generally in close agreement with those mentioned 710 above, except for the EC-Earth model family performing slightly less favourable (see "figs-refjra55/taylor" folder in the supplementary material to this article).

Summary and Conclusions
In the present study, 56 coupled general circulation model versions contributing historical experiments to CMIP5 and 6 have been evaluated in terms of their capability to reproduce the observed frequency of the 27 atmospheric circulation types orig-715 inally proposed by Lamb (1972), as represented by the ERA-Interim or JRA-55 reanalyses. The outcome is an objective, regional-scale ranking catalogue that is expected to be of interest for the model development teams themselves, and also for the downscaling and regional climate model community asking for model selection criteria. In this context, the present study is a direct response to the claim for a circulation-based model performance assessment made by Maraun et al. (2017). In addition, a straightforward method to describe the complexity of the coupled model configurations in terms of considered climate system 720 components has been proposed.
On average, the model versions used in CMIP6 perform better than their CMIP5 predecessors. This finding is in line with Cannon (2020)  For a subgroup of 13 out of 56 models, the impact of internal model variability on the performance was assessed with 72 additional historical model integrations, each one initialized from a unique starting date of the corresponding pre-industrial 735 control run. The thereby created initial conditions uncertainty has little effect on the overall results. Albeit the point-wise ranking order might change by a few integers when alternative runs are evaluated, which is why a "best model" map is intentionally not provided here, a well performing model would not even change to an "intermediate" one or vice versa if another ensemble member was put to the test. A similarly small effect was found for changing the reference reanalysis from the models' ranking order: the southwestern United States, the Gobi desert, and Greenland plus the surrounding seas.
Since the inclusion of more component models in a coupled model configuration provides a more realistic representation of the climate system and also yields distinguishable future scenarios (Séférian et al., 2019;Jones, 2020), it would make sense to consider this as an additional model selection criterion in future studies. The approach proposed here is intended to be a straightforward starting point to measure this criterion. It should be further refined as soon as more detailed model 745 documentation, already provided for some climate system components (Séférian et al., 2020), become available in a systematic way, e.g. via the Earth System Documentation project (https://es-doc.org/).
Complementary to Brunner et al. (2020), the here provided metadata about the participating component models can also be used to estimate the a priori degree of dependence between the numerous coupled model configurations used in CMIP.

Appendix A 750
The ocean grids referred to in Table 1  I would also like thank the Agencia para la Modernización Tecnológica de Galicia (AMTEGA) and the Centro de Supercomputación de Galicia (CESGA) for providing the necessary computational resources.

Cannon, A.: Reductions in daily continental-scale atmospheric circulation biases between generations of Global Climate
See text for more details.