From R-squared to coefficient of model accuracy for assessing "goodness-of-fits"

Modelers tend to focus more on advancing methods of statistical and mathematical modeling than developing novel techniques for comparing modeled results with observations or establishing metrics for model performance assessment. Perhaps solely the most extensively applied "goodness-of-fit" measure especially for assessing performance of regression models is the coefficient of determination R. Normally, high R tends to be associated with an efficient model. Nevertheless, R has been cited to have no importance in the classical model of regression. Even in its use in descriptive 10 statistics, R is known to have questionable justification. R is inadequate in assessing model performance because it does not give any information on the model residuals. Furthermore, R can be low for an effective model. Contrastingly, a very poor model fit can yield high R. Regressing X on Y yields R which is the same as that if Y is regressed on X thereby invalidating its use as a coefficient of determination. Taking into account the drawbacks of using R, this paper introduces coefficient of model accuracy (CMA) the derivation of which comprises an analogy to the R. However, instead of simply squaring an 15 ordinary Pearson's product-moment correlation coefficient to obtain R, CMA comprises the product of nonparametric sample correlation and model bias. Acceptability of the introduced method can be found demonstrated through comparison of results from simulations by hydrological models calibrated using CMA and other existing objective functions. MATLAB and R codes as well as an illustrative MS Excel file to compute the CMA were provided.


Introduction 20
Despite the advances in methods of statistical and mathematical modeling, Alexander et al. (2015) asserts that there continues to exist considerable lack of focus on improving ways to judge the quality of the models. Here, quality can be thought of in terms of how well a model fits a set of observations and this can be described as "goodness-of-fit". The coefficient of determination (hereinafter denoted as R 2 or interchangeably used with R-squared) is perhaps solely the most extensively applied measure of "goodness-of-fit" especially for regression models (Kvålseth, 1985). In various studies from 25 the various disciplines of Geosciences (also called Earth sciences) such as atmospheric science, hydrology, and environmental science, R-squared is commonly applied in many studies. For instance in hydrology, some of the studies which applied correlation and/or R 2 to evaluate model performance include Chen et al. (2019), Lane et al. (2019), Faiz et al. (2018), Krysanova et al. (2018), Nagraj et al. (2018), Unduche et al. (2018), Bennett et al. (2013), and Ritter and Muñoz-https://doi.org/10.5194/gmd-2020-51 Preprint. Discussion started: 4 May 2020 c Author(s) 2020. CC BY 4.0 License. Carpena (2013). In various studies, researchers also commonly use correlation and/or R 2 to evaluate reanalyses or satellite 30 precipitation products. Examples of such studies include Fallah et al. (2020), Shaodan et al. (2020), Irvem and Ozbuldu (2019), Trinh-Tuan et al. (2019), Zandler et al. (2019), Derin et al. (2017), Jiang et al. (2017), Wang et al. (2017), Peña-Arancibia et al. (2013), and Ward et al. (2011).
In assessments of "goodness-of-fits", there is the common tendency to associate a high value of R 2 to an effective model (Quinino et al., 2013). However, there have been very strong statements made by some researchers such as Goldberg 35 (1991), and Cameron (1993) on the use of R 2 to measure "goodness-of-fit". Goldberg (1991) asserted that "the most important aspect of R 2 is that it has no importance in the classical model of regression". Furthermore, R 2 is not a statistical test, and there seems to be no intuitive justification for its use as a descriptive statistic" (Cameron 1993), suggesting that the value of R 2 should not even be reported (Quinino et al., 2013). The use of correlation-based approaches (such as R 2 ) for evaluating model performance has been argued by a number of researchers such as Legates and Davis (1997) and Willmott 40 (1981) to be an inappropriate practice. Further insights on the limitations of R 2 were given by some researchers such as Sherri and McGuire (2019), Li (2017), Alexander et al. (2015), Krause et al. (2005), Schemper (2003), Golbraikh (2002), Weglarczyk (1998), and Cameron and Windmeijer (1997). A few reasons why R 2 is inadequate to assess predictive power of models are that: R 2 can be low for an accurate model, and on the other hand, an inaccurate model can yield high R 2 (Shalizi, 2015). Another problem is that the value of R 2 when we regress X on Y turns out to be the same as the R 2 attainable by 45 regressing Y on X. This means that R 2 cannot be taken to be indicative of the variance in observations explained by a model, thereby invalidating the use of R 2 as the coefficient of determination. Furthermore, the value of R 2 does not give any information on the model residuals. In other words, R 2 only quantifies dispersion but not bias in the data. For instance, if we take 10% of observed data to act as modeled results, the R 2 value will still be 100% despite the 90% bias. This means that R 2 should always be accompanied with residual analyses. The common practice is to use residual plots to give an insight on the 50 bias. Furthermore, support for the analyses of residuals tend to be independently obtained using Root Mean Squared Error (RMSE), and the Model Average Bias (MAB). By the time of writing this paper, there was no formula available which addresses the shortcomings of R 2 by comprehensively quantification of dispersion and the measure of bias. Therefore, this study aimed at introducing coefficient of model accuracy (CMA) the derivation of which bears an analogy to the well-known R 2 . The formula of CMA was carefully derived as a product of non-parametric correlation 55 coefficient and the measure of model bias. CMA varies over the range 0-1. The remaining parts of this paper were organized as follows. Section 2 comprises a stepwise derivation of CMA. Comparison of CMA and other "goodness-of-fit" measures was made using a modeling and simulation case study as presented in Section 3. Results and discussion of the case study were presented in Section 4. Finally, conclusions were made in Section 5. https://doi.org/10.5194/gmd-2020-51 Preprint. Discussion started: 4 May 2020 c Author(s) 2020. CC BY 4.0 License.

Step-wise derivation of CMA
The first step is the transformation of modeled data Y. To do so, consider u i as the number of times the i th data point is 80 exceeded by others within the given sample. Again, for the given sample, take v i as the number of times the i th data point exceeds others. Let e i denote the number of times the i th data point appears within the sample. We can non-parametrically transform Y and X in terms of the difference between exceedance and non-exceedance counts of data points to obtain series d y and d x , respectively using , , , where u y, v y , and e y denote u , v, and e applied to y's. Similarly, the u x, v x , and e x represent u , v, and e applied to X.
It is worth noting that the d y and d x from Eqs. (5) and (6), respectively, can be comparable to simply ranking Y and X in descending order. The difference between simply ranking the data in descending order and applying transformation using 95 Eqs. (5) or (6) is that the ordinary data ranks would all be positive and with their mean not equal to zero; however, d y or d x has a mean of zero and varies over the range  

 n
Thus, the f in Eq. (7) is simpler to compute using d y and d x than when ordinary ranks of Y and X are considered. The values of f ranges from -1 to 1. When 0, f  it means X and Y are totally uncorrelated or the non parametric linear regression slope is zero. The cases  YX and  YX yield 1 f  and 1,  f respectively. Importantly, unlike r from Eq. (1), f (Eq. 7) is not susceptible to possible outliers in the data. 100 The third step of deriving CMA includes quantification of the measure of model bias. Ordinarily, model bias or error tends to be in terms of the difference between the modeled and observed data points. In this case, summation of the errors requires normalization by a measure of the variability in the observations. This is the basis for a number of metrics such as MAB, RMSE, and NSE which all present the common issue of not having their values within the "standard" range of zero to one.
Let the comparison baseline  be a function of x for instance ,   x and 3 .
  x Here, x as defined in Eq. (1) 110 is based on the original or untransformed X . Consider the deviations 1  and 2  to be used in expressing the measure of model bias β. If min and max denote the minimum and maximum of any two values, Normally errors of opposite signs can cancel each other during their summation. This was avoided in Eqs.  and 2  were considered such that for 1 ≤ i ≤ n, The values of β ranges from zero to one. An ideal model yields β = 1. When β = 0, it means the outputs of the model can simply be represented by the mean of observed data. Analogous to Eq. (4), the CMA can be computed using, The values of CMA ranges from 0 to 1. CMA equal to one indicates a perfect model (no errors). However, CMA equal to zero indicates that the model is not better than the comparison baseline (such as the mean of observed data). An important note on selection of  is that putting   x in Eqs. (9) and (10)    x was adopted in this paper.
3 Case study

Data and selected models
It is a common practice to compare a new method with existing ones in geosciences. To do so, quality controlled daily hydro-meteorological data consisting of the Blue Nile flow observed at El Diem, as well as potential evapotranspiration 135 (PET) and rainfall over and around the basin were adopted in a catchment-wide form from a previous study (Onyutha, 2016). The adopted PET, river flow and rainfall were daily series covering the period 1980-2000. Two hydrological models selected were selected to generate "goodness-of-fits" for comparison. These models included the Hydrological Model focusing on Sub-flows' Variation (HMSV) of Onyutha (2019), and Nedbør-Afstrømnings-Model (NAM) (Danish Hydraulic Institute DHI, 2007;Madsen, 2000;Nielsen and Hansen, 1973). These models were adopted for illustration because of their lumped 140 conceptual frameworks or structure which are compatible with the adopted catchment-wide averaged PET and rainfall. Daily PET and rainfall were used as model inputs. The model output was runoff and this was compared with the observed river flow.

Comparison of the CMA with other "Goodness-of-fits"
For the calibration of HMSV and NAM, the strategy for automatically changing the model parameters was based on the 145 Generalized Likelihood Uncertainty Estimation (GLUE) of Beven and Binley (2001). As a Bayesian approach, GLUE required several sets of model parameters which were randomized within stipulated limits. For each set of parameters, an objective function (also herein taken as the "goodness-of-fits") was selected and the HMSV or NAM run to obtain 5000 sets of simulations. The objective functions included the CMA (Eq. 14), the ordinary coefficient of determination R 2 (Eq. 4), https://doi.org/10.5194/gmd-2020-51 Preprint. Discussion started: 4 May 2020 c Author(s) 2020. CC BY 4.0 License. RMSE (Eq. 15), MAB (Eq. 16), NSE (Nash and Sutcliffe, 1970) Efficiency NSE (Eq. 17), and IoA (Willmott, 1981) IoA 150 (Eq. 18) proposed the such that The number of parameter sets for which each hydrological model was run was set to 5000. The optimal parameters were those in the set which yielded the best value of the objective function. In other words, the best set of parameters was obtained with the maximum values of CMA and R-squared, while RMSE and MAB were required to at their minimum. The calibration was able to yield 5000 values of each "goodness-of-fit" for comparison.
For further analyses, simulated modeled was obtained based on the calibration using each of the objective functions. 160 Comparison of the "goodness-of-fits" was made in terms of the difference between observed and modeled flow computed using 8 criteria or metrics annual maximum series (MMaxS), annual minimum series (MMinS), long-term mean of the daily series, Coefficient of Variation (CV), skewness, kurtosis, and Inter-Quartile range (IQR).
The "goodness-of-fits" were ranked such that the one with the lowest absolute difference was given a rank of 1. In other words, a rank of 6 was given to the "goodness-of-fit" which yielded the largest absolute difference. The sum of ranks from 165 all the 8 criteria was obtained. The "goodness-of-fit" with the lowest sum of ranks was considered to yield the best modeled results. relationship between RMSE and MAB is linear with a negative slope (Fig. 1a). This means the values of both RMSE and https://doi.org/10.5194/gmd-2020-51 Preprint. Discussion started: 4 May 2020 c Author(s) 2020. CC BY 4.0 License.
MAB are large in magnitude for a poorly fit model. However, with increasing performance of the model, both RMSE and MAB tend toward zero. However, based on whether the model over-estimates or under-estimates the observed data points, MAB can take any positive or negative value, respectively. From Fig.1b, it is evident that for MAB of large magnitude, the CMA is zero indicating a model with a very poor fit. However, as the magnitude of MAB reduces, CMA increases such that 175 for the set of optimal parameters, MAB and CMA tend o zero and one, respectively. The relationship between RMSE and CMA ( Fig. 1c) was found to follow a power function. Just like for the MAB, CMA is zero for large values of RMSE (Fig.   1c) indicating poor model fit. With improvement in the model performance, RMSE reduces while CMA tends towards 1.
An ideal model yields MAB or RMSE of zero. However, both RMSE and MAB do not have standard ranges. RMSE takes 180 any value from zero to positive infinity. MAB ranges from negative infinity to positive infinity. Thus, when MAB or RMSE is not zero (which is often the case in normal modeling practices), the judgment of the model performance becomes subjective. Sums-of-squares-based error or deviation statistics such as the mean-absolute deviation and RMSE yield values which are counterintuitive and are therefore inappropriate measures of the typical errors (Mielke and Berry, 2001). This problem stems from the fact that each squared error may not meaningfully be comparable with other squared errors which 185 lead to the set of squared errors; in other words, the triangle inequality may not be satisfied (Mielke and Berry, 2001).
Eventually, there are interpretational difficulties or ambiguities inherent in sums-of-squares-based error statistics (such as RMSE) due to their dependence on the average error and the variability within the set of error magnitudes (Wilmott, 2009).
It is noticeable that R-squared remained high whether RMSE or MAB was high or low ( Fig. 1e-f). Furthermore, when CMA 190 remained zero, R-squared varied over a wide range with some values close to 1 (Fig. 1d). This was because the R-squared does not quantify model bias as done by CMA. However, as CMA increased from zero towards one, the variability of the Rsquared reduced. This showed that with improvements in the model performance, the variability in observed and simulated data become increasingly comparable. Nevertheless, the general poor performance by the R-squared depicted its limitations (see Introduction Section) in case it is to be used for assessing model performance. capture model bias (an element which 195 basically influences performance of models).
Generally, the results from the HMSV agree with those of NAM with respect to the overall relationships between the "goodness-of-fits". However, because the HMSV and NAM have different structures and parameters, some slight differences between results from the two selected models the in terms of the magnitudes of the "goodness-of-fits" are noticeable. This difference is expected and cannot be surprising. In fact, even if one model was to be calibrated against observations from two 200 systems (i.e. catchments in this case), there would still be some differences in model results. This is because, the model inputs (such as rainfall and PET) and physiographic characteristics such as catchment area would differ and eventually, the parameter spaces to lead to model optimality would never be the same. With respect to comparison of R 2 and CMA, an important question to answer is: If X is regressed on Y, is it similar to regressing Y on X? Answering this question requires a close look at Eqs. (1), (4), (7) and (13) yy is commutative thereby ensuring that the value of r (and eventually the R 2 ) remains the same whether we are considering X versus Y or Y versus X. Similarly, f from Eq. (7) remains unchanged when we consider X versus Y or regressing Y on X. However, because of the differences in the means of X and Y, the 1  (Eq. 9) obtained by regressing X on Y is different from that when Y is regressed on X. Again, the 2  (Eq. 10) obtained through 210 regression of X on Y is not the same at that for when Y is regressed on X. Therefore, so long as means of X and Y are not the same, β (Eq. 13) and eventually CMA (Eq. 14) obtained by regressing X on Y is always different from that when Y is regressed on X. This explains why CMA is superior to R 2 . Figure 1 (Fig. 2a). However, with improvement in the model fit, both NSE and CMA tended towards 1. It is worth noting that NSE has a wide range of values. In Fig. 2a, c-d, NSE was as low as −60. Actually, NSE can yield values down to −∞. NSE is known to be sensitive to bias in model prediction and is normally influenced by the 220 outliers if present in the series (McCuen et al. 2006). The suitability of NSE has been on the modelers' radar for decades (see, for instance, Garrick et al.(1978), Martinec and Rango (1989), Legates and McCabe (1999), Krause et al.(2005), Criss and Winston (2008), Le Moine (2008), Gupta et al. (2009), Ehret andZehe (2011), Legates and McCabe (1999), and Legates and McCabe (2013)). As a result there are several variants of NSE based on its modifications to address some of the issues related to the use of the original version of Nash and Sutcliffe (1970). For instance, to overcome the oversensitivity of NSE 225 to peak high values stemming from the influence of squaring the error terms, logarithmic NSE is widely used (Krause et al. 2005). Another improvement was by Legates and McCabe (1999) by considering the ratio of the sum of absolute (instead of squared) differences between modeled and observed series divided by the sum of absolute (and not the squared) deviations of the observed data points from their mean. Despite the improvements, the original NSE continues to be more applicable than its variants, and thus its adoption in this study. 230

Comparison of CMA with NSE and IoA 215
The IoA increased towards its maximum value of one more rapidly than CMA (Fig. 2b-c). When CMA was zero (or for negative NSE), the values of IoA were noticeably high up to about 0.8 (Fig. 2b-c). In the same line, for negative NSE https://doi.org/10.5194/gmd-2020-51 Preprint. Discussion started: 4 May 2020 c Author(s) 2020. CC BY 4.0 License.
(indicating very poor model fit as depicted by zero CMA), the values of R-squared were high or close to 1 conversely showing good model performance (Fig. 2d-e). From Fig. 2e, it can be seen that for a small IoA, R-squared exhibited large values and vice versa. In the same line, for MAB of large magnitude, the IoA was still large (Fig. 2f). Unlike the NSE, other 235 metrics IoA, and R 2 range from 0 to 1. The issues on the use of R 2 for model assessment were already given in the Introduction Section. The original version of IoA (Willmott, 1981) which was adopted for comparison purpose has a major drawback of giving high values (close to 1) even for poorly fit models (Krause et al., 2005). This was the reason why when CMA was zero, the IoA was as high as 0.6 (Fig. 2b). To address the problems related to the use of IoA, Willmott et al. (2012) reformulated the IoA such that the refined metric is bounded by −1 and 1 like the correlation coefficient. In a relevant 240 communication by Legates and McCabe (2013), they remarked that the refinement of the original IoA by extending its bound over the range −1 to 0 was unnecessary. Other limitations of the refined of the refined IoA can be found elaborately given by Legates and McCabe (2013). Figure 2  245 Figure 3 shows observed versus modeled series. The optimal parameters obtained based on calibration by the various objective functions can be seen from Appendix A1. The corresponding values for the sets of optimal parameters based on the various objective functions were summarized in Appendix A2. As often done in hydrological modeling, the full time series is normally divided into two periods. The first and second periods are for calibration and validation, respectively. However, 250 in this study, the entire or full time series was used for calibration. This was because the focus of this study was more on the evaluation of "goodness-of-fits" or objective functions than transferability of model parameters from calibration period to perform simulations over the validation period. The bias in capturing peak high flows varied among the objective functions ( Fig. 3a-f). For instance, the use of MAB for calibration of the HMSV and NAM respectively led to over-estimations and under-estimations of high peak flow events (Fig. 3e). Possible contrast in the peak flows from the two models was due to the 255 differences in model structures and also the parameter spaces considered for calibration. Nevertheless, the results from the models calibrated using the various objective functions adequately captured the variation in observed flow. This indicated the acceptability of CMA as a "goodness-of-fit" measure.

Comparison of observed and modeled series
>>INSERT Figure 3 Figure 4 shows a results of observed and modeled flow in terms of distribution parameter and hydrological extremes. Apart 260 from the over-estimation based on MAB, variance and long-term mean of daily flow, CV, MMinS, and IQR were well captured based on calibrations using the various objective functions (Fig. 4a, d-f, h). It was found that the minimum difference in variance of observed flow and NAM-based modeled series was obtained when calibration was performed using the IoA as the objective function. R-squared was the best in reproducing the IQR when used to calibrate HMSV. However for NAM, the largest difference between the IQR of observed and that of modeled series was obtained using R-squared. 265 Again, the use of MAB for calibrating NAM (HMSV) reproduced the most (least) biased MMaxS. Over-estimations for skewness and actual excess kurtosis were slight and large, respectively. Usually when both skewness and actual excess kurtosis are zero, it means that the data follows a normal distribution. As realized from Fig. 4b-c, the observed flow was slightly positively skewed (skewness = 1.60) and leptokurtic (kurtosis = 1.79). This showed that the central peak of the distribution of observed data was higher and sharper while its tails were fatter and longer than the normal distribution. The 270 under-estimations of skewness and kurtosis could be due to failure of the models to capture the large intermittency (or large difference between maximum and minimum values) in the river flow data. In a data scarce region (like where the study area is located), extreme high and low peak flow events (like those falling outside the IQR) tend to be difficult to capture by hydrological models because of the data limitation and poor quality.
Results also show that on one hand, the use R-squared for calibration led to larger over-estimations of observed variance, 275 MMinS, and IQR (Fig. 4a, f, h) by NAM than HMSV. On the other hand, over-estimations of observed variance, MMinS, and IQR were larger for HMSV than those of NAM (Fig. 4a, f, h). Outputs from two models can not be the same because of the difference in model structures. For HMSV and NAM, the minimum difference between MMaxS of observed and modeled series was obtained using R-squared and MAB, respectively. For an ideal (or completely unbiased) model, the difference between observed and modeled flow is zero. Model bias can be brought about by various factors such as 280 unrealistic assumptions in developing the modeling concept, observation errors of model inputs, etc. When we consider only one model, results from Fig. 4 were comparable. This, again, showed the acceptability of the introduced "goodness-of-fit" metric.

Conclusions
This study introduced the CMA for assessing model performance. CMA is obtained as the squared product of nonparametric correlation and the measure of bias. CMA ranges from 0 to 1. CMA = 1 indicates an ideal model (no errors).
However, CMA = 0 shows that the model is not better than the comparison baseline (such as the mean of observed data).
Unlike R-squared, regressing X on Y yields CMA which is different from that when Y is regressed on X. To demonstrate the 290 suitability of the CMA, two hydrological models HMSV and NAM were calibrated using the GLUE strategy (Beven and Binley, 1992) based on a number of objective functions including CMA, R 2 , RMSE, MAB, NSE, and IoA. Results from the https://doi.org/10.5194/gmd-2020-51 Preprint. Discussion started: 4 May 2020 c Author(s) 2020. CC BY 4.0 License. calibration were compared among the objective functions. Because the CMA combines quantifications of both the dispersion in the data and model bias, it outperformed R 2 and was highly competitive with the other "goodness-of-fit" measures. Therefore, CMA can be considered as an alternative to R 2 in evaluation of model performance. 295

Conflict of interest
The author declares no conflict of interest and no competing financial interests.

Acknowledgment
The author is grateful to Copernicus Publications for granting the waiver of article processing charges for this paper.

Author Contribution statement 300
The entire work in this paper was based on the effort of the sole author.