An improved multivariable integrated evaluation method and NCL code for multimodel intercomparison (MVIETool version 1.0)

An evaluation of a model’s overall performance in simulating multiple fields is fundamental to model intercomparison and development. A multivariable integrated evaluation (MVIE) method was proposed previously based on 10 a vector field evaluation (VFE) diagram, which can provide quantitative and comprehensive evaluation on multiple fields. In this study, we make further improvements to this method from the following aspects. (1) We take area weighting into account in the definition of statistics in the VFE diagram and MVIE method, which is particularly important for a global evaluation. (2) We consider the combination of multiple scalar fields and vector fields against multiple scalar fields alone in the previous MVIE method. (3) A multivariable integrated skill score (MISS) is proposed as a flexible index to measure a 15 model’s ability to simulate multiple fields. Compared with the MIEI proposed in the previous study, MISS is a normalized index that can adjust the relative importance of different aspects of model performance. (4) A simple-to-use and straightforward tool, the Multivariable Integrated Evaluation Tool (MVIETool version 1.0), is developed to facilitate an intercomparison of the performance of various models. The tool is coded with the open-source NCAR Command Language (NCL), which provides a calculation of MVIE statistics and plotting. With the support of this tool, one can easily evaluate 20 model performance in terms of each individual variable and/or multiple variables.

Climate models are commonly evaluated in terms of their ability to simulate historical climate compared to observed or reanalyzed data, using performance metrics (Pincus et al., 2008;Flato et al., 2013). A set of useful metrics and diagrams has 30 https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License. been developed for model evaluation. The widely used Taylor diagram summarizes model performance in simulating a scalar field using correlation coefficient (CORR), standard deviation (SD), and root-mean-square difference (Taylor, 2001).
Objective performance metrics (e.g., relative error and portrait diagrams) have been proposed for the evaluation of various variables (Glecker et al., 2008). Xu et al. (2016) devised a vector field evaluation (VFE) diagram which can be regarded as a generalized Taylor diagram. The VFE method allows an evaluation of a model's ability to simulate a vector field (Huang et 35 al., 2019;. Based on the VFE diagram, Xu et al. (2017) further developed a multivariable integrated evaluation (MVIE) method to evaluate model performance in terms of multiple fields by grouping various normalized scalar fields into an integrated vector field. The MVIE method also defined a multivariable integrated evaluation index (MIEI) to summarize the model's overall performance in simulating multiple fields. The MIEI, the VFE diagram, and the performance metrics of individual scalar variables constitute a hierarchical model evaluation framework, which can provide a quantitative and 40 comprehensive evaluation of model performance.
However, the MVIE method proposed by Xu et al. (2017) considered only the integrated evaluation of various scalar fields. Under certain circumstances, both scalar variables and vector variables (e.g., air temperature and vector wind fields) warrant evaluation together. Moreover, including the Taylor diagram, VFE diagram, and MVIE method, most previous model performance metrics did not consider spatial weight, which is potentially important for an evaluation of the global 45 field. This paper aims to improve the MVIE method according to the following aspects: (1) We take spatial weighting into account in all the statistics involved in the VFE and MVIE. (2) The improved MVIE method allows a mixed evaluation of scalar and vector fields. (3) Furthermore, based on MIEI, a multivariable integrated skill score (MISS) for a climate model is proposed, which allows us to adjust the relative importance of different aspects of model performance. Finally, we develop a Multivariable Integrated Evaluation Tool (MVIETool version 1.0) to facilitate multimodel intercomparison. 50 The paper is organized as follows. Section 2 defines statistical metrics that take spatial weighting into account. Section 3 introduces the improved MVIE method with a combination of scalar and vector fields, and interprets the performance metrics. Section 4 gives an overview of the MVIETool and describes the technical process of the MVIETool, including setting of the arguments in scripts. In Section 5, the applications of the tool are demonstrated by showing three examples with ten CMIP Phase 5 (CMIP5) models. Finally, a summary is given in Section 6. 55

Statistical metrics
The MVIE method primarily consists of three statistical quantities -root-mean-square length (RMSL), vector similarity coefficient (VSC), and root-mean-square vector difference (RMSVD) -that measure model performance in simulating a vector field from various aspects (Xu et al., 2017). RMSL measures the magnitude of a vector field, VSC measures the similarity of two vector fields, and RMSVD measures the overall difference between two vector fields. MIEI was defined by 60 using root-mean-square (rms) values of all variables and VSC, and is a concise metric to rank models in terms of their performance in simulating multiple fields. However, the definition of these statistical quantities did not consider area https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License. weighting, which could to a certain extent misrepresent the relative contribution of different latitudes to the statistics. Here, we redefine these statistical quantities by taking area weighting into account. Each field is composed of N vectors in time and/or space and M is the dimension of the integrated vector field. 70

Uncentered statistics
Similar to the weighted statistics defined by Watterson (1996), we define the weighted RMSL (L A , L O ), VSC (R v ), and RMSVD as follows: (1) where w j is the spatial weighting factor and the sum of w j is equal to 1. In terms of equal weight, w j is equal to 1/N for all j.
Note that RMSVD does not decrease monotonically with an improvement in model performance. To measure model performance more accurately, Xu et al. (2017) devised a multivariable integrated evaluation index, termed MIEI, of climate 90 model performance: Note that the first and second terms of the right-hand side of Eq. (7a) can vary from 0 to +∞ and from 0 to 4, respectively.
Thus, the MIEI may be too sensitive to rms bias and insensitive to pattern bias. To fix this problem we redefine MIEI as follows: 95 where S * is defined as: The relative importance of a model's ability to simulate the pattern similarity and amplitude of variables depends on the application. Hence, a weight factor F is added to the MIEI to adjust the relative importance of rms and VSC: We can further define a multivariable integrated skill score (MISS) of a climate model: MISS varies from -F / (F + 1) to 1. MISS reaches its minimum value of -F/(F+1) when VSC equals -1 and R m * equals 0.
Note that VSC is usually greater than 0 in terms of model evaluation. Thus, we find that MISS usually varies from 0 to 1. It is very unlikely that MISS will be less than 0, unless VSC is less than 0. MISS varies monotonically with respect to model performance, and reaches its maximum value of 1 when the model performs best. With an increase in F, MISS is less (more) sensitive to the model's ability to simulate amplitudes (patterns). 110

Centered statistics
As well as uncentered statistics, centered statistics are also important when the anomaly field is the main concern of model evaluation. For the centered mode, centered RMSL (cRMSL), centered VSC (cVSC), and centered RMSVD (cRMSVD) with spatial weights are defined to evaluate the model performance in terms of anomaly fields. The centered statistics are the same as the uncentered statistics, except that the original field is replaced by the anomaly field. These statistics are written as 115 follows: https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License.
One can use the vector mean error (VME) to additionally measure the difference between two mean vector fields since the 120 mean difference was removed from the centered statistics mentioned above. The VME can also be written as the root-meansquare error of two mean fields: As the uncentered statistics (Eqs. (1)-(3)) can be transformed into centered statistics (Eqs. (11)-(13)), by replacing the original field with the anomaly field, cRMSL, cVSC, and cRMSVD also satisfy the cosine law: 125 Furthermore, with the aid of Eq.
(3), we can decompose the RMSVD of the original field as follows: With the aid of Eqs. (13)-(14), RMSVD, cRMSVD, and VME satisfy the Pythagorean Theorem: Clearly, these statistics for vector variable evaluation satisfy the cosine law and Pythagorean Theorem. Similarly, such relationships are also valid for scalar variables (Taylor, 2001;Xu and Han, 2019).
Similar to rms_std (Eq. (6)), the standard deviation of SD (SD_std) is also defined to describe the dispersion of SD over 135 all variables: where, cL * Am is the same as L * Am , except that it is the ratio of SDs.

MVIE with a combination of multiple scalar and vector fields
The MVIE method proposed by Xu et al. (2017) considers only multiple scalar fields. Under some circumstances, one may 140 want to simultaneously evaluate both scalar and vector fields. Here, the MVIE method is improved to meet this need.
Assume there are M individual variables to be evaluated, which are either scalar or vector fields (the upper left part of Fig. 1).
Variable d m is the dimension of the mth variable, where d m is equal to 1 for a scalar field (e.g., temperature) while d m is 2 for a two-dimensional vector field (e.g., a vector wind), and so on. Hereafter, the vector field for an individual variable (e.g., a 850-hPa wind field) is termed the individual vector field to separate it from vector fields grouped from multiple fields. 145 Following the idea of MVIE, these variables are normalized with respective rms values of the reference. Note that an https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License.
individual scalar field is normalized by dividing by its rms value. An individual vector field is normalized as a whole by dividing by its RMSL. These normalized scalar and/or vector fields can be grouped into a multivariable field with the dimension D×N, where D is the sum of d m . The multivariable field derived from the model can be evaluated against that derived from observation by using the various performance metrics in the uncentered or centered mode (Fig. 1). 150 The uncentered mode focuses on the whole original field, while the centered mode separately evaluates the anomaly field and the mean field. Each mode of statistics consists of three levels of statistics: statistics for individual variables (yellow boxes), multivariable integrated statistics (green boxes), and an index summarizing the overall model performance (orange boxes). The definitions of centered and uncentered statistics are the same as those defined by Xu et al. (2017), except that the area weight is considered here. To calculate the statistics for individual variables (e.g., root-mean-square difference (RMSD), 155 centered RMSD (cRMSD) or uncentered CORR (uCORR)), we can also use the formulas of multivariable integrated statistics (Eqs. (1)-(3), (11)-(14)) by setting M equal to d m , which is summarized in the right-hand part of the boxes in Fig. 1.
Note that the mean error (ME) is additionally computed for the centered statistics, as the centered statistics exclude mean error. For a scalar variable, ME is calculated with Eq. (14) by setting M to 1, but it is signed. Because VME is a function of the difference in the vector magnitude and direction, we provide two additional statistical metrics -the mean error of vector 160 magnitude (MEVM) and the mean error of vector direction (MEVD) -to separate the magnitude error from the directional error. MEVM (MEVD) is the mean of the magnitude error (direction difference) between the modelled vector and the observed vector for all grids evaluated. Note that the MEVD is only valid for vector fields. The direction difference ranges from −180 to 180, and the positive (negative) value represents a counter-clockwise (clockwise) directional error of the model mean vector. 165 To summarize the overall model skill score in terms of the simulation of multiple variables, the uncentered MISS (uMISS) and centered MISS (cMISS) are provided for the uncentered and centered modes, respectively. uMISS is calculated with Eqs.
(9) and (10) using the original fields. cMISS can also be calculated with Eqs. (9) and (10), but replacing the rms and VSC by SD and cVSC, respectively. With the support of these statistics, the improved MVIE method can provide a more comprehensive and precise evaluation of model performance. All statistics defined in this paper together with their acronyms 170 are summarized in the table of A1 in the Appendix.

Brief overview
The Multivariable Integrated Evaluation Tool (MVIETool) consists of two main scripts and some function scripts. All these scripts are written in NCL, which can be easily used in Linux and Macintosh operating systems. The two main scripts are 175 Calculate_MVIE.ncl and Plot_MVIE.ncl. The execution of the MVIETool can be simplified to two runs, which work independently but in sequence (Fig. 2). Users can modify arguments written in a module at the beginning of the main scripts according to the application. The script assumes that the model data and observation data are saved in Network https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License.
Common Data Form (NetCDF) format. The Calculate_MVIE.ncl script calculates the statistical metrics defined in this paper. The output of this script is saved in a NetCDF file, which is used as the input to Plot_MVIE.ncl for plotting 180 the VFE diagram and the metrics table.

Preparing the input data
The MVIETool requires two groups of datasets as inputs -the model data and observations -saved in NetCDF format.
Each model or observational data file includes all the variables to be evaluated. If the variables are saved separately as CMIP data, one can easily merge these variables into one data file by using third-party software: e.g., the Climate Data Operator 185 (CDO) or NetCDF operators (NCO). The main script also assumes that all model and observation data files are stored in the same directory. Therefore, users need to move or link various data into the same directory. Variables stored in the data file need to have the same resolution. In terms of vector variable, each component of the vector variable should be stored independently. If users want to consider area weighting in the statistics, the variables should be saved with the coordinate information (e.g., time, latitude, and longitude), because the coordinate information is needed for the calculation of area 190 weighting. Currently, the tool can only identify the time and geographical coordinates: i.e., time, latitude, longitude, and level. Figure 3 illustrates an example that consists of ten models (M1-M10) to be evaluated and two sets of reanalysis (REA1, REA2) data as reference in the same directory. Each data file includes eight variables: Q600, SLP, SST, T850, u850, v850, u200, and v200. Among these, u850 (u200) and v850 (v200)

Usage and workflow of the MVIETool
Once datasets have been prepared, one can use the MVIETool to evaluate model performance. Users should set some arguments at the beginning of Calculate_MVIE.ncl based on the application. The arguments are summarized in Table   1 and discussed in this section. The rightmost column of Table 1 gives an example of argument setting. Arguments 1-5 define the data file information as in the example of Fig. 3. Note that, in the argument Varname in Table 1, the vector 205 variable is identified by enclosing its components in parentheses: e.g., (u200, v200). Notably, if users want to add spatial weights to the statistics, the data should have the latitude coordinate, by setting arguments 9-11. Some arguments are mandatory, e.g., argument 1-5 in Table 1, while some arguments are optional. We provide a default value for most of the https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License.
arguments and the default value will be used if users do not specify the argument, except for Range_time, Coords_geo, Range_geo and VarLev. These four arguments must be set, when input variables have corresponding coordinates. 210 After reading the data, the reference data is calculated with observation or reanalysis data. As shown in Fig. 4, if users provide only one observational dataset, it is directly used as the reference data. Otherwise, the mean of multiple observational datasets is used as the reference and each observational piece of data will also be evaluated against the ensemble mean to measure the observational uncertainty. Considering that some variables may contain missing values and some may not, to make model performances comparable to each other, the script will standardize the missing points across 215 all variables of each model dataset and the reference data. For example, if a variable is filled with a missing value in a grid cell, the other variables will be set to the missing value in this grid cell for this model.
The script can calculate the statistics either for a single variable or for multiple variables (Fig. 4). The left blue dotted box in Fig. 4 shows the calculation process for single variable evaluation (SVE). The centered and uncentered modes calculate the centered and uncentered statistics, respectively. Note that the calculated statistical metrics rely on the type of input 220 variable. If it is a scalar field, uCORR (CORR), rms (SD), and RMSD (cRMSD) are calculated in the uncentered (centered) mode. With regard to a vector field, VSC (cVSC), RMSL (cRMSL), and RMSVD (cRMSVD) are calculated in the uncentered (centered) mode. In addition, two skill scores defined by Taylor (2001), S 1 and S 2 , are computed in both the centered and uncentered modes for a scalar variable. Similarly, S v1 and S v2 are calculated for a vector variable (Xu et al., 2016). After the calculation, these statistics are saved in a file that can be used to generate a Taylor diagram or VFE diagram. 225 In terms of the MVIE (green dotted box of Fig. 4), the statistics of individual variables (i.e., scalar variables and vector variables) are calculated first. After the evaluation for each individual variable, all variables are normalized by the respective rms values of the reference and are grouped into a multivariable integrated field for the calculation of multivariable statistics.
To consider the relative importance of various variables, weights can be added to each variable after normalization through the argument Wgt_var (Table 1). In the centered mode, either VME or MEVM is computed for a vector variable and the 230 multivariable field based on the argument Cal_VME (Table 1). When MEVM is chosen, if the individual 2D vector variable exists, the MEVD is also calculated for it. Finally, both the centered and uncentered MISS are calculated. If more than one piece of observational data is available, the statistics between each observation and the reference are also calculated to take the observational uncertainty into account. After the calculation, the statistics calculated above are written in a new NetCDF file specified by the MVIE_filename argument. Meanwhile, the ranges of these statistics can be printed on the 235 screen if the Print_stats_r is set to True, helping users to set the color levels in the metrics

Interpretations of plots are discussed in detail with examples in Section 5. In addition, all arguments in the MVIETool and 240
https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License. their descriptions are summarized in readme.namelist for users' reference. We also provide a graphic user guide to help users quickly familiar with the usage of MVIETool.

Application of the tool
To illustrate the application of the tool, monthly mean datasets of ten CMIP5 models derived from the first All datasets are regridded to a common resolution of 2.5° ´ 2.5° using a bilinear interpolation method before the evaluation.

Metrics table
The metrics table can show various model performance metrics in terms of individual variables, multivariable integrated field, and the overall model skill scores in either centered or uncentered mode. Figure 5 shows the metrics table of various 255 statistical metrics, which evaluate six climatological mean fields -SLP, SST, Q600, T850, uv850, and uv200 -with centered statistical metrics. The statistics filled with a lighter color represent better model performance, and those with a darker color represent worse model performance. Different types of statistics are separated from each other by a thick black line. To facilitate the comparison of the metrics from different variables, in the centered mode, the SD (cRMSL), cRMSD (cRMSVD), and ME (VME) of the models are normalized by dividing by the corresponding SD (cRMSL) of the reference. 260 In the uncentered mode, rms (RMSL) and RMSD (RMSVD) are normalized using rms (RMSL).
The metrics table of the centered statistics decomposes the original field into mean and anomaly fields for evaluation. The anomaly fields are further evaluated from the perspective of pattern similarity, variance, and overall difference between the model and observation. The metrics table can clearly explain how much of the overall error comes from the mean error (ME, VME), the amplitude error of the anomaly field (SD, cRMSL), or the error in pattern similarity of the anomaly field (CORR, 265 cVSC). For example, the ME and cRMSD of M1 in simulating SLP are 0.129 and 0.566, respectively, indicating that the overall error is caused mainly by the error in the anomaly field (Fig. 5). The cRMSD can be further attributed to the poor amplitude (1.269) and pattern similarity (0.902), which can be shown more clearly in a Taylor diagram or VFE diagram.
Similarly, one can also decompose model errors into mean error (VME) and overall error of the anomaly field (cRMSVD) in terms of the simulation of multiple variables. 270 https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License.
To summarize and rank the overall performance of a model in simulating multiple fields, the MISSs in both centered and uncentered modes are provided in the metrics table and are expected to provide a more accurate evaluation relative to MIEI. Figure 5 shows that M2 ranks eighth out of ten models when referring to the values of the centered MIEI (cMIEI), while it ranks fifth based on the cMISS. The main reason is that cMIEI is sensitive to the error in SDs, particularly for an SD greater than 1. For example, the SD of SLP in M2 is 1.396 and it contributes about 0.026 to the first term on the right-hand side of 275 Eq. (7a). The cVSC of SLP in M2 is 0.953, which contributes about 0.094 to the second term on the right-hand side of Eq.
(7a). Referring to the definition of L Am * (Eq. (8)), 1.396 is equivalent to its reciprocal, 0.716, in the sense of model performance. However, the SD of 0.716 contributes only about 0.013 to the first term on the right-hand side of Eq. (9). Thus, the MISS is equally sensitive to the model's abilities to simulate pattern similarity and amplitude. In addition, cMISS (uMISS) also allows us to adjust the relative importance of SD (rms) values and cVSC (VSC) based on the application (Eq. 280 (10)).
The MVIETool can be used in a very flexible way by modifying the arguments in the main scripts. A comparison of Fig. 5 with Fig. 6 helps to explain the flexibility. For example, users can choose the statistics to be displayed. In Fig. 6, only a few statistical metrics are displayed in comparison with Fig. 5. Unlike Fig. 5, Fig. 6 divides each grid cell into four triangles, representing model performance in each of the four seasons. Currently, the grid cell can be divided into two or four triangles. 285 If no value of metrics is displayed in the grid cell, colored bars and a box legend are provided for reference. Moreover, considering that the relative number of models compared to the variables and statistical metrics may vary with the application, the MVIETool allows users to transpose the metrics table into a portrait or landscape orientation. For example, the model labels (statistical metrics and variables) can be arranged on the top or the left of the metrics table. More detailed technical introduction for plotting is provided in the graphic user guide. 290

VFE diagram
The VFE diagram is used to measure the model's ability to simulate the original (anomaly) vector or multiple fields in terms of three statistics: RMSL (cRMSL), VSC (cVSC), and RMSVD (cRMSVD). Figure 7 is the VFE diagram for a vector variable generated by the MVIETool. It assesses the climatological mean 850-hPa vector winds of the Northern Hemisphere in autumn derived from ten CMIP5 models (M1-M10) during the period 1961-2000. Since the anomaly field of 850-hPa 295 vector winds is considered, cRMSL, cVSC, and cRMSVD are shown in the diagram. The construction of the VFE diagram is based on the geometric relationship between the three statistics. Thus, in this diagram, the azimuthal position gives VSC (cVSC), the radial distance from the origin indicates RMSL (cRMSL), and the distance between the model and the reference points provides RMSVD (cRMSVD). Similar to the metrics table, RMSL (cRMSL) and RMSVD (cRMSVD) were normalized by the RMSL (cRMSL) of the reference to facilitate an intercomparison between different variables. The VFE 300 diagram can clearly show how much of the overall difference between model and observation is caused by poor pattern similarity and how much is due to the difference in the field amplitude (Xu et al., 2017). Fig. 7a is the same as Fig. 7b except https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License. that Fig. 7a applies spatial weighting to the statistics, but Fig. 7b does not. Note that for some models (e.g., M2, M5, and M8), there is a relatively large difference between the statistics without and with spatial weighting. We recommend taking area weight into account in model evaluation in a spatial field, especially for regions covering a broad span of latitudes. 305 Furthermore, the tool can also generate a VFE diagram with SD (rms), CORR (uCORR), and cRMSD (RMSD), and in this situation it is the same as the Taylor diagram (Taylor, 2001;Xu et al., 2016).

Summary
In this paper, we have improved the MVIE method and developed the MVIETool to support the evaluation of model performance using this method. The improved MVIE method can evaluate overall model performance in simulating multiple 310 scalar variables and vector variables. In addition, we consider spatial weighting in the definition of statistics, which is important for the evaluation of spatial fields on a longitude and latitude mesh grid. Based on MIEI, we further define a more accurate multivariable integrated skill score, termed MISS, to evaluate and rank the overall model performance in simulating multiple variables. Similar to MIEI, MISS also takes both amplitude and the pattern similarity into account, but it is a normalized index with the maximum value of 1 representing a perfect model. MISS is also flexible and able to adjust the 315 relative importance between the pattern similarity and amplitude.        600-hPa specific humidity (Q600), sea level pressure (SLP), sea surface temperature (SST), 850-hPa temperature (T850), 850-hPa winds (uv850) and 200-hPa winds (uv200) in the Northern Hemisphere in autumn (September-October-November). ME (VME) is the error of the mean scalar or vector (multivariable) fields. 425 https://doi.org/10.5194/gmd-2020-310 Preprint. Discussion started: 27 November 2020 c Author(s) 2020. CC BY 4.0 License.

Argument
Description Example 1 Varname Names of independent variables stored in data file. For individual vector variable, names of its components need to be enclosed in parentheses.

* hasLevel
Whether variables have level coordinate, set it to the name of the level coordinate; otherwise set False. The level coordinate is required to read data at specific level.