the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not
Download
- Final revised paper (published on 19 Jul 2022)
- Preprint (discussion started on 11 Mar 2022)
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2022-64', Anonymous Referee #1, 08 Apr 2022
Review comments for “Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not”
General:
The manuscript provides an interesting discussion on the choice of RMSE and MAE from the likelihood perspective. When neither RMSE nor MAE is optimal, the author lists several options: refining the model, transforming the data, using robust statistics, and constructing a better likelihood. However, the suggested options are not quite helpful for users of the standard statistical metrics. For instance, the likelihood-based inference is deemed the most versatile by the author, but most readers are not expected to explore this option in their future applications. If possible, providing some examples with illustrations will be very helpful.
Overall the paper is well written and quite informative, although there are some questionable remarks and inaccurate statements as detailed below.
Major points:
The abstract mostly states the motivation of the paper and a not well-substantiated opinion. The reader might need to rewrite the abstract to better reflect the actual content of this paper.
In Section 6 of the paper, “Why not use both RMSE and MAE?”, the author argues against using both RMSE and MAE. It is stated that presenting both metrics is “unnecessary and potentially confusing” “If the evidence strongly supports one over the other”. In most applications, such evidences to strongly support one metric over the other are not easily attainable. While decomposing one metric into several independent components suggested by the author is viable, using multiple metrics is still a practical way to avoid mistaken conclusions caused by merely relying on one metric.
Specific points:
Abstract Lines 3-4, “Some of this confusion arises from a recent debate between Willmott and Matsuura (2005) and Chai and Draxler (2014), in which either side presents their arguments for one metric over the other. Neither:
While Chai and Draxler (2014) argued against favoring MAE over RMSE by Willmott and Matsuura (2005), they did not favor RMSE over MAE. That is clearly stated in the abstract, as quoted below.
"The RMSE is more appropriate to represent model performance than the MAE when the error distribution is expected to be Gaussian. In addition, we show that the RMSE satisfies the triangle inequality requirement for a distance metric, whereas Willmott et al. (2009) indicated that the sums-of-squares-based statistics do not satisfy this rule. In the end, we discussed some circumstances where using the RMSE will be more beneficial. However, we do not contend that the RMSE is superior over the MAE. Instead, a combination of metrics, including but certainly not limited to RMSEs and MAEs, are often required to assess model performance."
Lines 33-34: “That recent shift may explain why Willmott and Matsuura (2005) and Chai and Draxler (2014) were unaware of the historical justification for MAE and RMSE; neither were they the first to overlook it”:
The author might be entitled to have such a judgement of others’ unawareness or overlook, but it is better to avoid such opinions in a scientific paper.
Equation 10: A close parenthesis is missing
Line 180: Please add a comma after “Laplace”.
Line 209. “… though wrongly suggest that MAE only applies to uniformly distributed errors”:
It is not accurate. Although Chai and Draxler (2014) gave one example with “uniformly distributed errors” to show that the MAE would be a good metric for such cases, it was not suggested that they are the ONLY cases where MAE would be appropriate. This statement is a misinterpretation of the paper.
Citation: https://doi.org/10.5194/gmd-2022-64-RC1 -
AC1: 'Reply on RC1', Timothy Hodson, 12 Apr 2022
Thank you taking time to review my manuscript. You make several good points that have helped me to clarify and strengthen my arguments. I don't agree with some of them, but they are all important to consider and illustrate the confusion left by earlier papers on this topic.
Major points1. Regarding the abstract:
Yes, the abstract is reductive, particularly in the case of Chai and Draxler, but their paper is somewhat inconsistent in its arguments. They do state clearly their belief that neither metric is inherently better but offer little explanation for why and, instead, list several reasons for preferring RMSE to MAE. For the abstract, I would be willing to frame Chai and Draxler in this manner, rather than as favoring RMSE.Chai and Draxler (2014) state their objective as "to clarify the interpretation of the RMSE and the MAE." I agree that this is an incredibly important topic, but I also believe their paper has important flaws. Rather than providing a point-by-point rebuttal to their work, my paper focuses on the classic proofs for why and when RMSE and MAE work. Besides settling some aspects of the debate, these proofs prepare the reader to understand how formal likelihoods can address the limitations of RMSE and MAE.
In responding to RC1, I will present some of that point-by-point rebuttal, focusing on three of Chai and Draxler's arguments (listed in their order of occurence):
Argument 1: "The sensitivity of the RMSE to outliers is the most common concern with the use of this metric. In fact, the existence of outliers and their probability of occurrence is well described by the normal distribution underlying the use of the RMSE. Table 1 shows that with enough samples (n = 100), including those outliers, one can closely re-construct the error distribution."
Argument 2: "The MAE is suitable to describe uniformly distributed errors. Because model errors are likely to have a normal distribution rather than a uniform distribution, the RMSE is a better metric to present than the MAE for such a type of data."
Argument 3: "any single metric provides only one projection of the model errors and, therefore, only emphasizes a certain aspect of the error characteristics. A combination of metrics, including but certainly not limited to RMSEs and MAEs, are often required to assess model performance."
2. Regarding my warning against using multiple metrics:
The reviewer argues that in most applications, evidence to support one metric over the other is not easily attainable. This is not so. The law of likelihood states the evidence for one metric versus another is simply the likelihood ratio. I've shown how to compute the likelihoods associated with MAE and RMSE (the Laplace and normal, respectively). The "evidence" is simply the ratio of the two, which is easily attainable. In practice, one must also adjust for differences in degrees of freedom (yielding the AIC), which is described in detail in Burnham and Anderson (B&A). I cited B&A, but I will add a statement to this effect.
As the likelihood ratio approaches unity, it is reasonable to consider multiple metrics weighted by their "evidence."
I agree with Argument 3 that each metric presents a different measure (transformation) of the error, but there is an infinite variety of such transformations. How do we arrive at the best one? Why not error to the fourth power, etc? The standard approach is to select several candidates based on prior knowledge of the system, then weight them by the evidence (B&A). I discuss both steps in the paper, though I neglect to describe how they are used in conjunction. I will briefly mention that.Specific points
1. Abstract Lines 3-4:
The abstract is reductive, but abstracts have some license to be so. Chai and Draxler (2014) do state that neither metric is inherently better (Argument 3); however, they go on to list several arguments for why they prefer RMSE to MAE (Arguments 1,2, and others). These two sides are never completely reconciled in their paper. Many of their arguments for favoring RMSE are flawed, and beyond Argument 3, they offer none of the theory underlying their claim that "neither metric is better," only a simulation, which they use to simultaneously claim RMSE is better, and neither is better, ultimately advocating that it's best to present both metrics, as well as others.Consider Arguments 1 and 2. In Argument 1, they simulate a normal with a standard deviation of 1, then verify that the standard deviation of the result is 1. While this confirms the RMSE is appropriate for normals, it gives little explanation of why this is so. They go on to claim this as evidence that RMSE is robust to outliers, thereby suggesting that MAE, which is often preferred for its robustness, is superfluous. But the "outliers" in the normal distribution are not the "outliers" of robust statistics, which deals with fatter-tailed distributions or other deviations from the normal that are common in practice. Argument 2 is similarly unclear. It seems to say that MAE is suited only for uniform distributions, which are atypical, so RMSE is better, while simultaneously saying that RMSE is only better for normal distributions.
2. Lines 33-34 “That recent shift may explain why Willmott and Matsuura (2005) and Chai and Draxler (2014) were unaware of the historical justification for MAE and RMSE; neither were they the first to overlook it”:
The reviewer suggests I omit this opinion. I am willing to do so, but neither paper references the proofs, which would have negated this debate. Chai and Draxler were aware that RMSE assumes a normal error distribution, though they do not describe or reference why. However, they seem unaware of the classic proof for MAE, or else why would they associate MAE with uniform case, a case which they later claim is irrelevant (Argument 2), rather than focusing on the Laplacian or the contaminated normal, for which MAE is optimal and superior, respectively?
My intent with this comment was to remind the reader that the RMSE-MAE debate has come up several times before. I do not wish to offend the other authors. Prestigeous scientists, including R.A. Fischer, made a similar oversight, and I believe reminding readers of this fact is important context. I would also like to be clear that the intended target of my critique is not individual authors but a longstanding failure by the Earth-science community to fully integrate probability theory into its modeling practices. I'd prefer to make that claim directly, but I think it would draw additional criticism. In the interest of keeping this paper short and instructive, I defer that debate.
3. Equation 10: revised
4. Line 180: revised
5. Line 209 suggesting that Chai and Draxler argue that MAE only applies to uniform errors:
The reviewer is correct that Chai and Draxler don't explicitly make this claim, but I'm probably not alone in interpreting them this way: see Argument 2 for example, nor do they describe a case for which MAE would be better suited other than the uniform distribution, which they dismiss as not being useful. They mention the classic argument that MAE works better in the presence of outliers, but dismiss this as well. After claiming the sensitivity of RMSE is not a practical concern (Argument 1), they go on to say "in practice, it might be justifiable to throw out the outliers that are several orders larger than the other samples when calculating the RMSE." The first statement seems to contradicts the second, which admits the concern is well warranted. Why do Chai and Draxler argue for throwing out data points, rather than acknowledging this a case where MAE (or MAD) may be better? I cannot answer say, but it is another example of the dissonance within their paper.Citation: https://doi.org/10.5194/gmd-2022-64-AC1 -
AC2: 'Reply on RC1 (Ammendment)', Timothy Hodson, 15 Apr 2022
Lines 33-34: “That recent shift may explain why Willmott and Matsuura (2005) and Chai and Draxler (2014) were unaware of the historical justification for MAE and RMSE; neither were they the first to overlook it”:
On this point, I could say that "neither cite or explain the historical justification for MAE and RMSE," rather than "they overlook it"
Citation: https://doi.org/10.5194/gmd-2022-64-AC2
-
AC1: 'Reply on RC1', Timothy Hodson, 12 Apr 2022
-
RC2: 'Comment on gmd-2022-64', Anonymous Referee #2, 21 Apr 2022
Title: Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not
Author(s): Timothy O. Hodson
MS No.: gmd-2022-64
MS type: Review and perspective paper
This is a generally interesting and nicely written paper. Nice to see a presentation that looks to the source material; these days, too many people cite their own, their friends, and derivative work. Would be nice to see more authors placing their work in a more correct perspective.
I think it is useful to thoroughly explain reasoning as is done through much of the paper; there are places where more explanation could be provided. Some have been indicated in the detailed comments provided below.
I question the suggestion that there is a debate between Wilmott’s papers and Chai and Draxler (2014). In my mind, a debate has some back and forth and here we only see a comment by Chai & Draxler, but nothing in response from Willmott. Perhaps an alternate word would be more precise.
While I found the paper interesting, I returned several times considering the question “Who is the audience?” I think that presenting in a broader fashion for a wider audience would widen its appeal. Probably it is just me being ‘old school’ but the use of “we” is sometimes confusing as it is not specific that it is referring to only the author, or the wider community. Similarly, there seems to be unnecessary use of value laden words without sufficient support and discussion [e.g. better, best, simpler, and harder]. Many of these could be avoided and the presentation would benefit from logical explanation and support rather that an implied author opinion.
What is the metric for? How does it need to be applied? How does this affect how my model can be used? Too often, the metrics are simply a checklist and little thought seems to be applied to questioning whether the model is fit for purpose. How does any metric help address “Do you get the right answer for the right reasons?”
Another area that needs mentioning is that models are fit to data; data that often is assumed to be without error.
One area that needs clarification is the application to models. The manuscript never clearly suggests the modelling framework intended; the key issue is often that the observations and model output are time series and the residuals are unlikely to be iid but strongly autocorrelated. This is particularly true since the hydrological examples are for rainfall-runoff modelling. But, the arguments regarding MAE and MSE apply to random sampling as well.
Detailed comments:
Line 15 “L1-norm and L2-norm” would be better to explain that L1-norm is Manhattan distance and L2-norm is Euclidean distance. Could explain more fully here.
Line 18 is it actually a ‘debate’?
Line 27. I like the “historical” presentation.
Line 31. Would be good to add a bit more guidance.
Line 37 insert after observations “and models”
Line 37 for choosing between MAE and RMSE
Line 38 “for the complex error …”
Line 39 I would choose a better word than “Occasionally” “Where more concrete examples were needed, the examples were drawn from hydrology, particularly rainfall-runoff modelling.”
Line 43 delete “In their debate,”
[[ but there are observations, models, and theory/understanding
Line 59 delete “Despite its simplicity,”
There are several places in the text where value laden words as used loosely.
Line 61 “simpler problem of deduction” “harder problem of induction” Not sure that these value words help as they take a stance that is not necessary.
Line 72 “under certain conditions” It would be better to explain those conditions.
Line 81. Replace “Subsequent sections will …” with something like “Ways of relaxing these assumptions will be introduced below.” [The subsequent text covers much more that what was indicated here.]
Line 90 Choose a better word than “trick”. Perhaps “operation”? There was a case in recent memory where emails from East Anglia referred to a ‘trick” that the media ceased as evidence of subterfuge and dishonesty.
Line 125-127 seems awkward. The text regarding using likelihood functions to inform choices of model structure seems tangential.
Line 127 insert a period after “… 2001)”
Line 132 “often approximately log normal” and it would be good to specify that the underlying assumption is for only perennial streams.
Line 135 sentence requires citing a reference.
Also, here the converse is the real problem: interpreting the “results” without the transformation. While the units would be ‘correct’ the model assumptions would not be. It is also possible to transform, analyse, and retransform so the units are ‘correct’ but you face asymmetric confidence limits.
Line 139 “Student’s-t”
Line 144 use “normal distribution” rather than “normal condition”
Line 145 Perhaps the text regarding Tukey’s contributions in general is tangential?
Line 148 “Neither option is ideal” “Neither option is acceptable”?
Line 150 “Since Tukey’s work, some alternatives have emerged.” “better” seems to be a convenient opinion.
Line 156 “RMSE is inefficient (or more inefficient) for error…”
“Lacking an alternative, MAD is a popular choice.”
Line 164 “Log transforming”
Line 165 Not sure I would agree that this provides “reasonable” results. There are methods for dealing with the zero elements and non-zero elements separately.
Line 167 “… so the difference between 0.001 and 1 is same as that between 1 and 1000.
Line 172 “Log-likelihood is equivalent to the concept of entropy from information theory”. While true, it seems tangential to the main argument and distracts the reader.
Line 180 “… normal versus Laplace; Burnham and Anderson …”
Line 189 use “non-zero” instead of ‘positive’
Line 195 chose a word other than “naively” “without considerable thought”
Line 200 combining metrics is certainly not meaningful. Neither is a checklist of metrics without criteria or benchmarks. This is a place where the issue of “how can I use my model?” follows. What do the metrics tell you about the limits of model applicability?
Line 201 replace “best” with “most important with respect to model application.”
Citation: https://doi.org/10.5194/gmd-2022-64-RC2 -
AC3: 'Reply on RC2', Timothy Hodson, 26 Apr 2022
Thank you taking time to review my manuscript. Your comments were insightful and I will address all of them, except one that I didn't understand, which I note in my response.
Major Comments
--------------RC2: I question the suggestion that there is a debate between Willmott’s papers and Chai and Draxler (2014). In my mind, a debate has some back and forth and here we only see a comment by Chai & Draxler, but nothing in response from Willmott. Perhaps an alternate word would be more precise.
AC: I will frame the debate as between RMSE and MAE, in which Chai and Willmott are a recent installment.
RC2: While I found the paper interesting, I returned several times considering the question “Who is the audience?” I think that presenting in a broader fashion for a wider audience would widen its appeal. Probably it is just me being ‘old school’ but the use of “we” is sometimes confusing as it is not specific that it is referring to only the author, or the wider community.AC: My audience are readers of Willmott and Chai (>3000 citations each): Earth scientists with limited statistical training who use least squares (MSE) and least absolute deviations (MAE) extensively but have little-to-no awareness of formal likelihood methods. To paraphrase Burhnam and Anderson's (2001) comparison of least squares to likelihood methods, "likelihood methods are much more general and far less taught."
I attempted to write a brief pedagogical paper that describes how the familiar terms like RMSE and MAE arise from likelihood theory, gives examples of likelihoods' greater generality, then refers the reader to the important textbooks on this topic. The discussion of MAD is somewhat tangential to likelihood methods, but robustness is a common point in the “debate” between RMSE and MAE, and MAD is relevant in that respect.
RC2: On the use of value laden words [e.g. better, best, simpler, and harder]:AC: Good point. I'll find better descriptors.
RC2: What is the metric for? How does it need to be applied? How does this affect how my model can be used? Too often, the metrics are simply a checklist and little thought seems to be applied to questioning whether the model is fit for purpose. How does any metric help address “Do you get the right answer for the right reasons?”AC: The task is simple, in theory: choose the metric that will identify the most likely (“realistic”) model; for normal iid errors, minimizing the MSE yields the most likely model.
I agree with the reviewer that too often we blindly apply “checklists.” This paper introduces a more theory-based approach to evaluation, which has long been the standard other fields like ecology and economics.
RC2: Another area that needs mentioning is that models are fit to data; data that often is assumed to be without error.AC: Observational error is a common application of likelihood and Bayesian methods. I could mention some relevant texts, though I'm less familiar with the history.
RC2: One area that needs clarification is the application to models. The manuscript never clearly suggests the modeling framework intended; the key issue is often that the observations and model output are time series and the residuals are unlikely to be iid but strongly autocorrelated. This is particularly true since the hydrological examples are for rainfall-runoff modeling. But, the arguments regarding MAE and MSE apply to random sampling as well.AC: I suppose the main framework would be “likelihood methods,” though not exclusively. All rational frameworks (Bayesian, significance testing, etc) can be derived from probability theory. A point of the paper, is to remind readers that MAE versus MSE is a false dichotomy and whatever framework they choose, they should understand how it derives from probability theory. Autocorrelation is an important topic that could be addressed within the likelihood framework, but may be a bit advanced. Like Chai and Willmott, I sought to write a short readable paper but also to orient readers to the existing literature. I reference several papers that discuss autocorrelation in the context of rainfall-runoff modeling. I will think about a better general reference that I could cite.
Specific Comments
-----------------RC2: Line 15 “L1-norm and L2-norm” would be better to explain that L1-norm is Manhattan distance and L2-norm is Euclidean distance. Could explain more fully here.
AC: I'll note that, but I'd prefer to omit the equations. The conceptual link is important, but the equations are tangential here, I think.RC2: Line 18 is it actually a ‘debate’?
AC: Would ‘discussion’ or ‘discourse’ be better? I'm not sure. According to one source, a debate is "a formal discussion on a particular topic in a public meeting or legislative assembly, in which opposing arguments are put forward." I believe that definition consistent with Willmott and Chai. I will also try to frame the ‘debate’ as the two-century-long debate, in which Willmott and Chai are a recent iteration.RC2: Line 27. I like the “historical” presentation.
AC: Thank you.
RC2: Line 31. Would be good to add a bit more guidance.
AC: I'm uncertain what sort of guidance was intended. This line describes how many reference works give the proofs behind MSE and MAE but neglect to give a primary reference.RC2: Line 37 insert after observations “and models”
AC: revised
RC2: Line 37 for choosing between MAE and RMSE
AC: revisedRC2: Line 38 “for the complex error …”
AC: revisedRC2: Line 39 I would choose a better word than “Occasionally” “Where more concrete examples were needed, the examples were drawn from hydrology, particularly rainfall-runoff modeling.”
AC: revisedRC2: Line 43 delete “In their debate,”
AC: revisedRC2: Line 59 delete “Despite its simplicity,”
AC: revising as "This simple equation provides..."There are several places in the text where value laden words as used loosely.
RC2: Line 61 “simpler problem of deduction” “harder problem of induction” Not sure that these value words help as they take a stance that is not necessary.
AC: I believe this is an accurate summary of the frequentist argument. I try not to take a strong stance on this subject, but I would say I'm Bayesian in theory but sometimes frequentist in practice.
RC2: Line 72 “under certain conditions” It would be better to explain those conditions.
AC: The next sentence transitions into a more detailed discussion, which states these conditions.RC2: Line 81. Replace “Subsequent sections will …” with something like “Ways of relaxing these assumptions will be introduced below.” [The subsequent text covers much more that what was indicated here.]
AC: revised
RC2: Line 90 Choose a better word than “trick”. Perhaps “operation”? There was a case in recent memory where emails from East Anglia referred to a ‘trick” that the media ceased as evidence of subterfuge and dishonesty.
AC: Good point. This is a "convenient practice."
RC2: Line 125-127 seems awkward. The text regarding using likelihood functions to inform choices of model structure seems tangential.
AC: revised
RC2: Line 127 insert a period after “… 2001)”
AC: revised
RC2: Line 132 “often approximately log normal” and it would be good to specify that the underlying assumption is for only perennial streams.
AC: revised
RC2: Line 135 sentence requires citing a reference.
AC: I've found several papers that claim this, but none of them are primary. I think the point is that if we're comfortable thinking in linear or proportional scales, but it's more difficult to reason about other nonlinear scales.RC2: Also, here the converse is the real problem: interpreting the “results” without the transformation. While the units would be ‘correct’ the model assumptions would not be. It is also possible to transform, analyze, and retransform so the units are ‘correct’ but you face asymmetric confidence limits.
AC: The confidence limits are symmetric in geometric space; the error is multiplicative rather than additive. See Limpert et al. (2001): Log-normal distributions across the sciences.
RC2: Line 139 “Student’s-t”
AC: revised
RC2: Line 144 use “normal distribution” rather than “normal condition”
AC: revised
RC2: Line 145 Perhaps the text regarding Tukey’s contributions in general is tangential?
AC: True, Tukey was seminal but also intermediary. I did include several historical tangents lest we repeat them.
Many people use MAE as a robust metric because of Tukey's work, including Willmott, so I wanted to place Tukey in context. However, the primary purpose of the paper is pedagogical, so I'd considering omitting this comment if it is distracting.
RC2: Line 148 “Neither option is ideal” “Neither option is acceptable”?
AC: I prefer ideal (or optimal). I want to advocate that these are better practices, but I don't want to go so far as to claim that all others are unacceptable, though some arguably are (by “better”, I mean more efficient, more accurate, etc.).
RC2: Line 150 “Since Tukey’s work, some alternatives have emerged.” “better” seems to be a convenient opinion.
AC: Better in respect to the preceding statement "their performance degrades as deviation grows." Will revise as "more robust"
RC2: Line 156 “RMSE is inefficient (or more inefficient) for error…”
AC: revisedRC2: “Lacking an alternative, MAD is a popular choice.”
AC: revised
RC2: Line 164 “Log transforming”
AC: revised
RC2: Line 165 Not sure I would agree that this provides “reasonable” results. There are methods for dealing with the zero elements and non-zero elements separately.
AC: Those methods are discussed later, though you're right that “reasonable” was a poor choice as it implies they are “logical” or “rational.” Describing these methods as “practicable” or “sometimes satisfactory” would be better. XXX TODO
RC2: Line 167 “… so the difference between 0.001 and 1 is same as that between 1 and 1000.
AC: revised
RC2: Line 172 “Log-likelihood is equivalent to the concept of entropy from information theory”. While true, it seems tangential to the main argument and distracts the reader.
AC: Will omit.
RC2: Line 180 “… normal versus Laplace; Burnham and Anderson …”
AC: revised
RC2: Line 189 use “non-zero” instead of ‘positive’
AC: This sentence is describing the zero-inflated lognormal, so it is strictly positive. True, streamflow can take negative values, so lognormal isn't ideal, but that is an open problem. My intent is only to demonstrate how formal likelihoods offer additional flexibility beyond MSE, which assumes errors are normal and equal variance.
RC2: Line 195 chose a word other than “naively” “without considerable thought”
AC: revised
RC2: Line 200 combining metrics is certainly not meaningful. Neither is a checklist of metrics without criteria or benchmarks. This is a place where the issue of “how can I use my model?” follows. What do the metrics tell you about the limits of model applicability?
AC: I'll add some brief guidance here. I focused on methods for determining the most likely model. Assessing confidence or applicability is a related topic (via probability theory), but I hadn't planned on addressing it in this paper.
RC2: Line 201 replace “best” with “most important with respect to model application.”
AC: Will simply say "while exposing readers to several alternatives for when they fail"Citation: https://doi.org/10.5194/gmd-2022-64-AC3
-
AC3: 'Reply on RC2', Timothy Hodson, 26 Apr 2022