Multivariable integrated evaluation of model performance with the vector field evaluation diagram

This paper develops a multivariable integrated evaluation (MVIE) method to measure the overall performance of climate model in simulating multiple fields. The general idea of MVIE is to group various scalar fields into a vector field and compare the constructed vector field against the observed one using the vector field evaluation (VFE) diagram. The VFE diagram was devised based on the cosine relationship between three statistical quantities: root mean square length (RMSL) of a vector field, vector field similarity coefficient, and root mean square vector deviation (RMSVD). The three statistical quantities can reasonably represent the corresponding statistics between two multidimensional vector fields. Therefore, one can summarize the three statistics of multiple scalar fields using the VFE diagram and facilitate the intercomparison of model performance. The VFE diagram can illustrate how much the overall root mean square deviation of various fields is attributable to the differences in the root mean square value and how much is due to the poor pattern similarity. The MVIE method can be flexibly applied to full fields (including both the mean and anomaly) or anomaly fields depending on the application. We also propose a multivariable integrated evaluation index (MIEI) which takes the amplitude and pattern similarity of multiple scalar fields into account. The MIEI is expected to provide a more accurate evaluation of model performance in simulating multiple fields. The MIEI, VFE diagram, and commonly used statistical metrics for individual variables constitute a hierarchical evaluation methodology, which can provide a more comprehensive evaluation of model performance.

2 Constructing VFE diagram for multidimensional vector fields Xu et al. (2016) constructed the VFE diagram in terms of 2-dimensional vector fields. There are three statistical quantities in the VFE diagram, i.e., root mean square length (RMSL) of a vector field, vector similarity coefficient (VSC), and root mean square vector deviation (RMSVD) between two vector fields. In this section, each quantity will be defined and interpreted from the viewpoint of MVIE. Thereafter, we will construct the VFE diagram for multidimensional vector fields. 5

Root mean square length of a vector field
Consider two vector fields A and B, which can be spatial or/and temporal fields. Assume that vector fields A and B are derived from a climate model simulation and observation, respectively. Without loss of generality, vector fields A and B can be written as a pair of vector sequences: The root mean square lengths (RMSLs) for vector fields A and B are respectively defined as: (1) and (2) The square of L A is written as: 20 Similarly, we have

Vector similarity coefficient between two vector fields
In the same as for the vector similarity coefficient (VSC) for 2-dimensional vector fields (Xu et al., 2016), the VSC for Mdimensional vector fields can be defined as: The normalized vectors are written as: * = = ( 1 * , 2 * , … , * ) ; j = 1, 2, …, N 5 * = = ( 1 * , 2 * , … , * ) ; j = 1, 2, …, N With the aid of Eqs. (1) and (2), we have We can also represent Eq. (7) in the following form: VSC can be interpreted as the mean of inner products between normalized-and paired-vectors * and * . The squared Euclidean Distance (SED) between * and * is defined as follows: 10 With the aid of Eqs. (9) and (10), the sum of all SEDs can be written as: • With the aid of Eq. (8), we obtain With the aid of Eqs.
(1), (2), and (7), Eq. (17) can be written as: The RMSVD, L A , L B , and R v are related by the law of cosines (Eq. 18). We can construct the VFE diagram for Mdimensional vector fields based on Eq. (18). The VFE diagram and the geometric relationship between L A , L B , R v , and the RMSVD are shown in Fig. 1. As for the case of 2-dimensional vectors (Xu et al., 2016), the RMSLs, i.e., L A and L B , measure 5 the mean and variance of the lengths of vector fields A and B, respectively (Eqs. A2, A3). R v reflects the pattern similarity between two vector fields. The RMSVD describes the overall difference between two vector fields. Thus, three statistical quantities can be indicated by a single point on the VFE diagram ( Fig. 1).

Methodology 10
To evaluate model performance in terms of the simulation of multivariables, one can group various scalar fields into a vector field and compare the constructed vector field against the observed one using the VFE diagram. For example, we can construct a vector field with temperature and precipitation as its x-and y-component, respectively. One can certainly use more variables as needed to construct the vector field. Note that the statistical quantities RMSL, VSC, and RMSVD in the VFE diagram are defined in an orthogonal coordinate system in which the axes are perpendicular to each other. There is no 15 requirement for the independence of the variables to be evaluated, e.g., temperature and precipitation which are represented by coordinate values of individual axes. Thus, the VFE diagram can be applied to evaluate any combination of modeled variables against corresponding observational estimates. Given the differences in units and order of magnitude of various variables, we need to normalize all variables before grouping them into a vector field. The normalization can be done by dividing the RMS value of each observational estimate as follows: 20 is the RMS value for the i-th component of vector field B obtained from observational estimates.
Each component of the normalized vector field is dimensionless and on the order of 1. Thus, the statistics of each component are equally important to the total statistics of the vector fields. The normalization is especially necessary when the variables are of different orders of magnitude. For example, the surface air temperature (SAT) is typically on the order of 10 2 K, but precipitation is generally on the order of 10 -5 -10 -4 mm s -1 . Under this circumstance, the differences in the RMSL, VSC, and 5 RMSVD between various models would be primarily determined based on the SAT and barely impacted by the precipitation if no normalization was applied. Therefore, in terms of the MVIE of the model performance, the RMSLs, VSC, and RMSVD should be computed using the normalized vector fields ⋆ and ⋆ . As interpreted in section 2, three statistical quantities in the VFE diagram represent the overall statistics across all components between two vector fields. If the vector fields are grouped by various scalar fields, the VFE diagram can summarize the three statistics of model performance in simulating 10 multiple scalar fields.

Application of multivariable integrated evaluation of model performance
Without loss of generality, we choose the climatological mean SAT and precipitation as well as the temporal standard deviation of the SAT and precipitation as the variables to interpret the MVIE method. Four variables derived from climate models are examined against the corresponding observational estimates. The evaluation is based on the monthly mean 15 datasets derived from the first ensemble run of CMIP5 historical experiments during the period from 1961 to 2000 (Taylor, 2012). Three pairs of observed SAT and precipitation datasets are used in this study. The first pair of dataset is the Climatic Research Unit (CRU) gridded SAT and precipitation (Harris, et al., 2014). The second pair of dataset is the University of Delaware air temperature and precipitation (Willmott and Matsuura, 2001). The third pair of dataset is composed of the Global Historical Climatology Network (GHCN) temperature (Fan and van den Dool, 2008) and Global Precipitation 20 Climatology Centre (GPCC) precipitation (Schneider et al., 2014). All observational data are available at 0.5°×0.5° resolution. We take the average of three pairs of SAT and precipitation values as the reference data in this study, unless stated otherwise. The observational uncertainty can be roughly estimated by comparing each observational estimate to the reference data (Xu et al., 2016). All datasets were regridded to a common resolution of 2.5°×2.5° using a box averaging (bilinear interpolation) method that re-grids data to a coarse (finer) resolution. All datasets were weighted by the area of grid 25 cell to make the statistics more representatives for the global mean values. Both the model and observational data are normalized by the RMS value of each observed field before computing their statistics (Eqs. 19,20). Table 1 shows the various statistics of 9 CMIP5 models in terms of the climatological mean summer (June-July-August) SAT, precipitation, and the temporal standard deviation of SAT and precipitation over the global land area (60°S-60°N). The 30 standard deviation reflects the amplitude of interannual variation. The models can generally well simulate the climatological mean SAT characterized by the close correspondence of the RMS values, high uncentered correlation, and small RMSD between the model and observation. In contrast, models show a relatively poor performance in simulating other variables, i.e., climatological mean precipitation, standard deviations of SAT and precipitation. These statistics vary from one model to the next. It is difficult to compare the overall performances of various models because there are too many variables and models to distinguish one from another (Table 1). It is very useful to summarize the statistics of multiple variables with fewer indices, which enables an objective evaluation of the overall model performance in simulating multiple variables. To achieve 5 this goal, we grouped the four normalized scalar fields into a four-dimensional vector field. Afterwards, we computed the statistical quantities, i.e., RMSL, VSC, and RMSVD, with the four-dimensional vector fields derived from model and observational data. As interpreted in section 2, the RMSL (RMSVD) measures the overall RMS values (RMSDs) of all scalar fields (Eqs. 3, 16). The VSC represents the weighted average of uncentered correlation coefficients across all scalar fields (Eq. 13). Thus, each model's performance in simulating multiple variables can be summarized by a single point that is 10 determined by 12 statistical quantities (4 variables × 3 statistics) those derived from various scalar fields (Table 1, Fig. 2).
As shown in Fig. 2, the VSC varies from 0.90 to 0.95, indicating which models can better reproduce the overall spatial pattern of various variables and which cannot. For example, model 1 shows the maximum VSC, indicating that model 1 can generally better reproduce the spatial pattern of the four variables relative to other models. This can be confirmed by Table 1. 15 The uncentered pattern correlation coefficients for the four scalar fields are generally higher in model 1 than in the other models. Fig. 2 also clearly shows which model overestimates or underestimates the overall RMS values. For example, models 5 and 7 overestimate the RMSLs of the four-dimensional vector fields, suggesting that both models generally overestimate the RMS values of the four scalar fields. This can also be confirmed by Table 1 (Table 1). Similarly, the RMSVD between two vector fields can also reasonably represent the overall RMSDs of multiple scalar fields as shown in Fig. 2 (Fig. 2). The length of the line segment is equal to twice the standard deviation of RMS values of multiple scalar fields. Thus, the length of the line segment can measure the dispersion of various RMS values relative to their mean. A shorter line indicates that the RMS values are close to the mean. In contrast, a longer line segment indicates that the RMS values are spread out over a wider range. To measure the accuracy of modeled 10 RMS values to that of those observed, one can use the root mean square deviation of the RMS values of various variables: are the RMS values of the i-th normalized component of vector fields A and B, respectively. With the support of Eq. (6), we have ⋆ = 1 for all i (1 ≤ i ≤ M). The 2 can be further written as: where ⋆ ��� , and ⋆ ′ are the mean and anomaly of ⋆ , respectively. The RMS value of ⋆ ′ is written as follows: 15 is the centered RMS value or the standard deviation of ⋆ . Thus, the RMSD L can be decomposed into the mean error and the variance of RMS values of normalized scalar fields (Eq. 22). RMSD L measures the overall deviation of modeled RMS values from the observed ones. The modeled RMS values of various scalar fields are exactly equal to the corresponding observed ones only when the RMSD L is equal to 0.

Multivariable integrated evaluation index for model performance
In general, the model results get closer to the observational estimate as the RMSVD decreases. It is noteworthy that for a given VSC at a relatively low value, the RMSVD does not strictly decrease monotonically as the simulated RMSL 5 approaches the observed one (Fig. 3). For example, model B shows the same VSC as that of Model A but a smaller bias in the RMSL, which suggest that model B performs better than model A. However, the RMSVD is greater in model B than in model A (Fig. 3). Thus, the decrease in the RMSVD may not necessarily indicate an improvement in model performance. On the other hand, given the drawback of the RMSL in measuring the accuracy of RMS values, the model skill score, defined based on the RMSL and VSC in Xu et al. (2016), is also not well suited for measuring the model performance in simulating 10 multiple scalar fields. To better measure model performance, we define a multivariable integrated evaluation index (MIEI) based on the VFE diagram ( Fig. 3): Based on the law of cosines, we have Thus, the MIEI can be written as: Clearly, the MIEI takes both the amplitudes and pattern similarities of various variables into account and therefore can 15 provide a comprehensive evaluation of model performance (Eq. 24). In contrast to the RMSVD, the MIEI satisfies the monotonic property of an index with respect to model performance. Specifically, for any given and ⋆ ��� , the MIEI decreases monotonically with the increase in R v . For any given and R v , the MIEI decreases monotonically as ⋆ ��� approaches 1. For any given ⋆ ��� and R v , the MIEI decreases monotonically with the decrease in . The MIEI is equal to 0 only when =0, ⋆ ��� =1, and R v =1, which define a perfect model. In other words, modeled multiple fields are exactly the 20 same as the observed ones when the MIEI is equal to 0.
As interpreted in section 2, the RMSVD is determined based on the sum of quadratic RMSDs of various scalar fields (Eq. 16). Thus, the RMSVD is equivalent to the model climate performance index used in previous studies (e.g., Gleckler et al., 2008;Radić and Clarke, 2011;Chen and Sun, 2015). In general, both the RMSVD and MIEI can be used to measure the 25 model performance. However, the MIEI is expected to provide a more accurate evaluation of model performance than the RMSVD. For example, model 3 shows a smaller RMSVD but a larger MIEI compared to model 2 (Table 1, Fig. 2). The RMSVD and MIEI give an opposite rank in the performances of models 2 and 3. Note that model 3 shows a much greater standard deviation of RMS values (0.20) than that of model 2 (0.04), suggesting that model 3 poorly simulates the relative amplitude of the four variables. Such information is not considered by the RMSVD but can be captured by the MIEI (Eq. 21).
The values of the MIEI derived from various models are also shown in Fig. 2

. A smaller MIEI generally indicates a better
performance of the climate model. For example, models 1 and 6 show smaller MIEIs than those of other models. Models 1 and 6 show higher VSC values, smaller RMSD L values, and a close correspondence of RMS values with the observed ones 5 (Fig. 2). The MIEI can serve as an index to determine the rank of climate model performance in simulating multiple fields.
In comparison with the MIEI, the VFE diagram can provide a more detailed evaluation of the model performance by explicitly showing multiple statistics, i.e., pattern similarity, RMS values and their variances, and RMSVD.
The issue of how to take the observational uncertainties into account is of particular importance in model evaluation and 10 ranking, especially when more and more observational datasets provide estimates of the observational uncertainty. The statistics derived from each group of observational estimates are also shown in Table 1, which can roughly quantify the observational uncertainties and its impact on model evaluation. Generally, the colours are clearly lighter for the statistics of individual observed variables in contrast to the modelled variables (Table 1). This indicates that the observational uncertainties are relatively small and should have less impact on the evaluation of model performance. To further quantify 15 the impacts of observational uncertainty on ranking model performance, we calculate the MIEIs of various climate models by taking each group of observational estimates as the reference data. Three groups of observational estimates generate three groups of MIEIs. Afterwards, we calculate Spearman's rank correlation coefficient of each group of MIEIs with those derived from models and ensemble mean of multiple observational estimates. The Spearman's rank correlation coefficients are 0.996, 0.996, and 0.904, respectively, suggesting that the ranks are very close to each other no matter which group of 20 observational estimates is used as reference data. Thus, the observational uncertainty should have less impact on ranking model performance in this case. One can use the average of Spearman's rank correlation coefficients to quantify the consistency of various ranks when a number of observational estimates are available.

Summary and discussion
The multivariable integrated evaluation (MVIE) method proposed here provides a concise way of representing the multiple 25 statistics of multiple fields on a two-dimensional plot, i.e., the VFE diagram. The VFE diagram includes three statistical quantities, i.e., RMSL, VSC, and RMSVD, representing different aspects of model performance. Specifically, the RMSL (RMSVD) represents the total mean value and variance (total RMSDs) of all scalar fields. The VSC measures the overall pattern similarity across all scalar fields. As shown in the example, each of the three statistical quantities can reasonably represent the corresponding statistics of multiple scalar fields. Moreover, the VFE diagram can illustrate how much the 30 overall RMSD of various fields is attributable to the difference in RMS values and how much is due to poor pattern similarity. Thus, one can summarize multiple statistics of multivariables for various models in a diagram and facilitate the intercomparison of model performances in simulating multiple variables. The MVIE method can be applied to spatial or/and temporal fields. It can also simultaneously evaluate various temporal variabilities simulated by models, e.g., climatological mean state and the amplitude of interannual variability as shown in section 3.2. Based on the VFE diagram, we also developed a multivariable integrated evaluation index (MIEI) which takes the amplitude and pattern similarity of multiple fields into account. The MIEI satisfies the criterion that a model performance index should vary monotonically as the model 5 performance improves. The MIEI provides a more concise evaluation than the VFE diagram of model performance in simulating multiple fields.
The statistical metrics presented in this paper can be divided into three different levels and their relationships are summarized in a pyramid chart (Fig. 4). The first level of metrics, i.e., correlation coefficient, RMS value, and RMSD, 10 measures model performance in terms of individual variables. These metrics can be illustrated by a table of metrics (Table 1) (Table 1) or other model performance metrics as needed.
As shown in section 2, the VFE diagram can be constructed by using uncentered statistics, which are computed using the full scalar fields, including both mean and anomaly. The VFE diagram can also be computed by using centered statistics 25 (Appendix A). The centered RMSL of a vector represents the overall variance of all components of a vector field (Eq. A3).
The centered VSC can be interpreted as weighted average of Pearson's correlation coefficients, which measures the overall pattern similarity across all paired anomaly fields (Eq. A9). The centered RMSVD measures the sum of centered RMSDs across all paired components between two vector fields (Eq. A12). The type of statistics, i.e., centered or uncentered statistics, that should be used depends on the application. The uncentered statistics should be used if both the mean and 30 anomaly need to be evaluated. In contrast, the centered statistics should be used if the anomaly fields are the primary concern.
The centered correlations alone are not sufficient for detection studies (Legates and Davis, 1997). It has been argued that the uncentered statistics are better suited for detection because they incorporate the response of the mean value. In contrast, the centered statistics are more appropriate for attribution because they better measure the similarity between spatial patterns (Hegerl et al., 2001). The VFE diagram provides us flexibility in model evaluation. In terms of model evaluation aimed at a detection study, one can compute the uncentered statistics with full fields. In contrast, one can use centered statistics by computing the statistical quantities with vector anomaly fields if an attribution study is the major concern of model evaluation.

5
In practice, one may want to weight different fields based on their relative importance. If some variables to be evaluated are dependent to each other, e.g. skin temperature and surface air temperature, one may also want to weight these variables properly because the dependent variables contain redundant information. Consequently, the evaluation may overestimate the importance of the dependent variables. Determining the weight coefficient depends on the application and therefore is beyond the scope of this study. Here, we only discuss how the weight can be considered in the multivariable integrated 10 evaluation (Appendix B). The MVIE method presented in this study requires the normalization of each modeled and observed variable by dividing the corresponding RMS value of the observed variable (Eqs. 19,20). Therefore, one should weight different variables after the normalization (Eqs. B1, B2); otherwise the normalization process will remove the weight coefficient. Weighting each normalized field leads to a quadratic weighting of the quadratic RMS values, quadratic RMSDs, and correlation coefficient (Eqs. B1, B5, B8, B11). 15 The VFE diagram and MIEI may also provide some guidance in weighting various climate models to constrain future climate projection. A recent study suggested that model weighting should take both model performances and model interdependencies into account to improve climate projections (Knutti et al., 2017). The VFE diagram can summarize model performances in terms of multiple statistics of multivariables on one hand. On the other hand, the VFE diagram can also 20 clearly show the differences between model and observation as well as the differences between various models. These information provided by the VFE diagram may be used in weighting climate models, which warrant for further studies.

Code availability
The code used in the production of Figure 2 and Table 1 are available in the supplement to the article. 25

Appendix A: Decomposition of RMSL, VSC, and RMSVD
To further interpret the RMSL, VSC, and RMSVD, we break down the full vector fields A and B into the mean and anomaly: The squared RMSL of vector field A is written as follows: Given ∑ ′ =1 = 0, 2 can be written as: 15 is the RMSL of the mean vector field, is the RMSL of the vector anomaly field, and is the centered RMS value (or standard deviation) of the i-th component of vector field A.
Equation (A1) can be written as: 5 The RMSL of vector field A, L A , measures the overall mean value and variance of all components of the vector field.
Similarly, we have is the centered RMS value (or standard deviation) of the i-th component of vector field B. 10 With the support of Eq. (13), the VSC can be written as: Given the Cauchy-Schwarz inequality, Eq. (A7) can be rewritten as: Eq. (A8) can be rewritten as: where and are the centered RMS values (or standard deviation) of the i-th component of vector field A and B, respectively.
represents the centered correlation coefficients between the i-th paired components of vector fields A and B. ′ can be interpreted as a weighted average of the centered correlation coefficients across all paired components between two vector fields. The weight coefficients are proportional to the product of standard deviations between paired variables. Clearly, the VSC is simultaneously determined based on the correlation of various mean fields and the overall correlation of anomaly fields across all paired components between two vector fields (Eqs. A6, A7, A9). 5 The RMSVD between two vector fields can also be represented by the mean and anomaly fields: is the RMSVD between mean vector fields A and B, which represents the mean difference of all fields.
is the centered RMSVD between two vector fields, which represents the overall RMSD across all paired components of vector anomaly fields A and B. From the viewpoint of MVIE, the RMSVD can be interpreted as the overall mean difference of all fields plus the overall RMSD of all anomaly fields.

5
The statistics can be computed based on the full vector fields or anomaly vector fields depending on the concern of evaluation. The statistical quantities, i.e., RMSL, VSC, and RMSVD, computed based on the full vector fields represent the uncentered pattern statistics, which include the statistics from both the mean and anomaly fields. Alternatively, three statistics can also be computed based on the anomaly fields, yielding centered statistics, which only measure the anomaly fields. The full vector fields should be used if both the mean and anomaly need to be evaluated. In contrast, the anomaly 10 vector fields should be used if anomaly fields are the primary concern.
is the RMSD of the i-th paired components between normalized vector fields ⋆ and ⋆ . Similarly, the square of the RMSVD between weighted vector fields w and w can be written as follows: is the RMSD of the i-th paired components between weighted vector fields w and w . With the aid of Eqs. (B1), (B2), (B6), and (B7), we obtain The RMSVD between two vector fields is determined based on the weighted RMSDs across all paired components of two 5 vector fields. The contribution of the i-th RMSD to the quadratic RMSVD between two vector fields is weighted by 2 .
Based on Eq. (13), the VSC between normalized vector fields ⋆ and ⋆ can be written as follows: where , , and are the same as ⋆ , ⋆ , and ⋆ , respectively, except they are computed based on the weighted vector fields w and w . With the aid of Eqs. (B1), (B2), (B9), and (B10), we obtain The VSC is determined based on the sum of the products of the uncentered correlation coefficients and the RMS values. The contribution of the i-th product term, , to the VSC is weighted by 2 .

Author contribution
Z. Xu devised the evaluation method and wrote the paper. All of the authors discussed the results and commented on the manuscript. Table 1. Multiple statistics of CMIP5 models in simulating surface air temperature and precipitation in terms of climatological mean state and interannual variability. Tm (Pm): climatological mean surface air temperature (precipitation) in summer (June-July-August). Ta (Pa): temporal standard deviation of summer surface air temperature (precipitation). 5 CMIP5 simulations and three individual groups of observational datasets are compared with the ensemble mean of three groups of SAT and precipitation data observed during the period from 1961 to 2000. RMS: the ratio of modeled to observed root mean square (RMS) values of the spatial pattern for each variable. CORR (RMSD): uncentered spatial correlation coefficient (root mean square deviation) between model and observational fields. RMSL, Rv, RMSVD measure the statistics of two vector fields, which can represent the overall statistics of all fields (Eqs. 3,13,16). RMSL was shown as the ratio of 10 model simulated RMSL to the observed RMSL. RMS_stddev is the standard deviation of four RMS values, which describe the dispersion of RMS values of Tm, Pm, Ta, and Pa (Eq. 23). MIEI: multivariable integrated evaluation index (Eq. 24).

Tables
Model performance is indicated by the color scale: lighter colors denote better model performance.