Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not

Hodson, Timothy O.

doi:https://doi.org/10.5194/gmd-15-5481-2022

Articles | Volume 15, issue 14

https://doi.org/10.5194/gmd-15-5481-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-15-5481-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 15, issue 14

Review and perspective paper

| Highlight paper

|

19 Jul 2022

Review and perspective paper | Highlight paper |

| 19 Jul 2022

Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not

Timothy O. Hodson

Download

Final revised paper (published on 19 Jul 2022)
Preprint (discussion started on 11 Mar 2022)

Interactive discussion

Status: closed

RC1:
'Comment on gmd-2022-64', Anonymous Referee #1, 08 Apr 2022

Review comments for “Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not”

General:

The manuscript provides an interesting discussion on the choice of RMSE and MAE from the likelihood perspective. When neither RMSE nor MAE is optimal, the author lists several options: refining the model, transforming the data, using robust statistics, and constructing a better likelihood. However, the suggested options are not quite helpful for users of the standard statistical metrics. For instance, the likelihood-based inference is deemed the most versatile by the author, but most readers are not expected to explore this option in their future applications. If possible, providing some examples with illustrations will be very helpful.

Overall the paper is well written and quite informative, although there are some questionable remarks and inaccurate statements as detailed below.

Major points:

The abstract mostly states the motivation of the paper and a not well-substantiated opinion. The reader might need to rewrite the abstract to better reflect the actual content of this paper.

In Section 6 of the paper, “Why not use both RMSE and MAE?”, the author argues against using both RMSE and MAE. It is stated that presenting both metrics is “unnecessary and potentially confusing” “If the evidence strongly supports one over the other”. In most applications, such evidences to strongly support one metric over the other are not easily attainable. While decomposing one metric into several independent components suggested by the author is viable, using multiple metrics is still a practical way to avoid mistaken conclusions caused by merely relying on one metric.

Specific points:

Abstract Lines 3-4, “Some of this confusion arises from a recent debate between Willmott and Matsuura (2005) and Chai and Draxler (2014), in which either side presents their arguments for one metric over the other. Neither:

While Chai and Draxler (2014) argued against favoring MAE over RMSE by Willmott and Matsuura (2005), they did not favor RMSE over MAE. That is clearly stated in the abstract, as quoted below.

"The RMSE is more appropriate to represent model performance than the MAE when the error distribution is expected to be Gaussian. In addition, we show that the RMSE satisfies the triangle inequality requirement for a distance metric, whereas Willmott et al. (2009) indicated that the sums-of-squares-based statistics do not satisfy this rule. In the end, we discussed some circumstances where using the RMSE will be more beneficial. However, we do not contend that the RMSE is superior over the MAE. Instead, a combination of metrics, including but certainly not limited to RMSEs and MAEs, are often required to assess model performance."

Lines 33-34: “That recent shift may explain why Willmott and Matsuura (2005) and Chai and Draxler (2014) were unaware of the historical justification for MAE and RMSE; neither were they the first to overlook it”:

The author might be entitled to have such a judgement of others’ unawareness or overlook, but it is better to avoid such opinions in a scientific paper.

Equation 10: A close parenthesis is missing

Line 180: Please add a comma after “Laplace”.

Line 209. “… though wrongly suggest that MAE only applies to uniformly distributed errors”:

It is not accurate. Although Chai and Draxler (2014) gave one example with “uniformly distributed errors” to show that the MAE would be a good metric for such cases, it was not suggested that they are the ONLY cases where MAE would be appropriate. This statement is a misinterpretation of the paper.

Citation: https://doi.org/10.5194/gmd-2022-64-RC1
- AC1: 'Reply on RC1', Timothy Hodson, 12 Apr 2022
  
  Thank you taking time to review my manuscript. You make several good points that have helped me to clarify and strengthen my arguments. I don't agree with some of them, but they are all important to consider and illustrate the confusion left by earlier papers on this topic.
  
  Major points
  1. Regarding the abstract:
  
  Yes, the abstract is reductive, particularly in the case of Chai and Draxler, but their paper is somewhat inconsistent in its arguments. They do state clearly their belief that neither metric is inherently better but offer little explanation for why and, instead, list several reasons for preferring RMSE to MAE. For the abstract, I would be willing to frame Chai and Draxler in this manner, rather than as favoring RMSE.
  Chai and Draxler (2014) state their objective as "to clarify the interpretation of the RMSE and the MAE." I agree that this is an incredibly important topic, but I also believe their paper has important flaws. Rather than providing a point-by-point rebuttal to their work, my paper focuses on the classic proofs for why and when RMSE and MAE work. Besides settling some aspects of the debate, these proofs prepare the reader to understand how formal likelihoods can address the limitations of RMSE and MAE.
  In responding to RC1, I will present some of that point-by-point rebuttal, focusing on three of Chai and Draxler's arguments (listed in their order of occurence):
  Argument 1: "The sensitivity of the RMSE to outliers is the most common concern with the use of this metric. In fact, the existence of outliers and their probability of occurrence is well described by the normal distribution underlying the use of the RMSE. Table 1 shows that with enough samples (n = 100), including those outliers, one can closely re-construct the error distribution."
  Argument 2: "The MAE is suitable to describe uniformly distributed errors. Because model errors are likely to have a normal distribution rather than a uniform distribution, the RMSE is a better metric to present than the MAE for such a type of data."
  Argument 3: "any single metric provides only one projection of the model errors and, therefore, only emphasizes a certain aspect of the error characteristics. A combination of metrics, including but certainly not limited to RMSEs and MAEs, are often required to assess model performance."
  2. Regarding my warning against using multiple metrics:
  The reviewer argues that in most applications, evidence to support one metric over the other is not easily attainable. This is not so. The law of likelihood states the evidence for one metric versus another is simply the likelihood ratio. I've shown how to compute the likelihoods associated with MAE and RMSE (the Laplace and normal, respectively). The "evidence" is simply the ratio of the two, which is easily attainable. In practice, one must also adjust for differences in degrees of freedom (yielding the AIC), which is described in detail in Burnham and Anderson (B&A). I cited B&A, but I will add a statement to this effect.
  
  As the likelihood ratio approaches unity, it is reasonable to consider multiple metrics weighted by their "evidence."
  
  I agree with Argument 3 that each metric presents a different measure (transformation) of the error, but there is an infinite variety of such transformations. How do we arrive at the best one? Why not error to the fourth power, etc? The standard approach is to select several candidates based on prior knowledge of the system, then weight them by the evidence (B&A). I discuss both steps in the paper, though I neglect to describe how they are used in conjunction. I will briefly mention that.
  
  Specific points
  1. Abstract Lines 3-4:
  
  The abstract is reductive, but abstracts have some license to be so. Chai and Draxler (2014) do state that neither metric is inherently better (Argument 3); however, they go on to list several arguments for why they prefer RMSE to MAE (Arguments 1,2, and others). These two sides are never completely reconciled in their paper. Many of their arguments for favoring RMSE are flawed, and beyond Argument 3, they offer none of the theory underlying their claim that "neither metric is better," only a simulation, which they use to simultaneously claim RMSE is better, and neither is better, ultimately advocating that it's best to present both metrics, as well as others.
  Consider Arguments 1 and 2. In Argument 1, they simulate a normal with a standard deviation of 1, then verify that the standard deviation of the result is 1. While this confirms the RMSE is appropriate for normals, it gives little explanation of why this is so. They go on to claim this as evidence that RMSE is robust to outliers, thereby suggesting that MAE, which is often preferred for its robustness, is superfluous. But the "outliers" in the normal distribution are not the "outliers" of robust statistics, which deals with fatter-tailed distributions or other deviations from the normal that are common in practice. Argument 2 is similarly unclear. It seems to say that MAE is suited only for uniform distributions, which are atypical, so RMSE is better, while simultaneously saying that RMSE is only better for normal distributions.
  2. Lines 33-34 “That recent shift may explain why Willmott and Matsuura (2005) and Chai and Draxler (2014) were unaware of the historical justification for MAE and RMSE; neither were they the first to overlook it”:
  The reviewer suggests I omit this opinion. I am willing to do so, but neither paper references the proofs, which would have negated this debate. Chai and Draxler were aware that RMSE assumes a normal error distribution, though they do not describe or reference why. However, they seem unaware of the classic proof for MAE, or else why would they associate MAE with uniform case, a case which they later claim is irrelevant (Argument 2), rather than focusing on the Laplacian or the contaminated normal, for which MAE is optimal and superior, respectively?
  My intent with this comment was to remind the reader that the RMSE-MAE debate has come up several times before. I do not wish to offend the other authors. Prestigeous scientists, including R.A. Fischer, made a similar oversight, and I believe reminding readers of this fact is important context. I would also like to be clear that the intended target of my critique is not individual authors but a longstanding failure by the Earth-science community to fully integrate probability theory into its modeling practices. I'd prefer to make that claim directly, but I think it would draw additional criticism. In the interest of keeping this paper short and instructive, I defer that debate.
  3. Equation 10: revised
  4. Line 180: revised
  5. Line 209 suggesting that Chai and Draxler argue that MAE only applies to uniform errors:
  
  The reviewer is correct that Chai and Draxler don't explicitly make this claim, but I'm probably not alone in interpreting them this way: see Argument 2 for example, nor do they describe a case for which MAE would be better suited other than the uniform distribution, which they dismiss as not being useful. They mention the classic argument that MAE works better in the presence of outliers, but dismiss this as well. After claiming the sensitivity of RMSE is not a practical concern (Argument 1), they go on to say "in practice, it might be justifiable to throw out the outliers that are several orders larger than the other samples when calculating the RMSE." The first statement seems to contradicts the second, which admits the concern is well warranted. Why do Chai and Draxler argue for throwing out data points, rather than acknowledging this a case where MAE (or MAD) may be better? I cannot answer say, but it is another example of the dissonance within their paper.
  
  Citation: https://doi.org/10.5194/gmd-2022-64-AC1
- AC2: 'Reply on RC1 (Ammendment)', Timothy Hodson, 15 Apr 2022
  
  Lines 33-34: “That recent shift may explain why Willmott and Matsuura (2005) and Chai and Draxler (2014) were unaware of the historical justification for MAE and RMSE; neither were they the first to overlook it”:
  On this point, I could say that "neither cite or explain the historical justification for MAE and RMSE," rather than "they overlook it"
  
  Citation: https://doi.org/10.5194/gmd-2022-64-AC2
RC2:
'Comment on gmd-2022-64', Anonymous Referee #2, 21 Apr 2022

Title: Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not

Author(s): Timothy O. Hodson

MS No.: gmd-2022-64

MS type: Review and perspective paper

This is a generally interesting and nicely written paper. Nice to see a presentation that looks to the source material; these days, too many people cite their own, their friends, and derivative work. Would be nice to see more authors placing their work in a more correct perspective.

I think it is useful to thoroughly explain reasoning as is done through much of the paper; there are places where more explanation could be provided. Some have been indicated in the detailed comments provided below.

I question the suggestion that there is a debate between Wilmott’s papers and Chai and Draxler (2014). In my mind, a debate has some back and forth and here we only see a comment by Chai & Draxler, but nothing in response from Willmott. Perhaps an alternate word would be more precise.

While I found the paper interesting, I returned several times considering the question “Who is the audience?” I think that presenting in a broader fashion for a wider audience would widen its appeal. Probably it is just me being ‘old school’ but the use of “we” is sometimes confusing as it is not specific that it is referring to only the author, or the wider community. Similarly, there seems to be unnecessary use of value laden words without sufficient support and discussion [e.g. better, best, simpler, and harder]. Many of these could be avoided and the presentation would benefit from logical explanation and support rather that an implied author opinion.

What is the metric for? How does it need to be applied? How does this affect how my model can be used? Too often, the metrics are simply a checklist and little thought seems to be applied to questioning whether the model is fit for purpose. How does any metric help address “Do you get the right answer for the right reasons?”

Another area that needs mentioning is that models are fit to data; data that often is assumed to be without error.

One area that needs clarification is the application to models. The manuscript never clearly suggests the modelling framework intended; the key issue is often that the observations and model output are time series and the residuals are unlikely to be but strongly autocorrelated. This is particularly true since the hydrological examples are for rainfall-runoff modelling. But, the arguments regarding MAE and MSE apply to random sampling as well.

Detailed comments:

Line 15 “L1-norm and L2-norm” would be better to explain that L1-norm is Manhattan distance and L2-norm is Euclidean distance. Could explain more fully here.

Line 18 is it actually a ‘debate’?

Line 27. I like the “historical” presentation.

Line 31. Would be good to add a bit more guidance.

Line 37 insert after observations “and models”

Line 37 for choosing between MAE and RMSE

Line 38 “for the complex error …”

Line 39 I would choose a better word than “Occasionally” “Where more concrete examples were needed, the examples were drawn from hydrology, particularly rainfall-runoff modelling.”

Line 43 delete “In their debate,”

[[ but there are observations, models, and theory/understanding

Line 59 delete “Despite its simplicity,”

There are several places in the text where value laden words as used loosely.

Line 61 “simpler problem of deduction” “harder problem of induction” Not sure that these value words help as they take a stance that is not necessary.

Line 72 “under certain conditions” It would be better to explain those conditions.

Line 81. Replace “Subsequent sections will …” with something like “Ways of relaxing these assumptions will be introduced below.” [The subsequent text covers much more that what was indicated here.]

Line 90 Choose a better word than “trick”. Perhaps “operation”? There was a case in recent memory where emails from East Anglia referred to a ‘trick” that the media ceased as evidence of subterfuge and dishonesty.

Line 125-127 seems awkward. The text regarding using likelihood functions to inform choices of model structure seems tangential.

Line 127 insert a period after “… 2001)”

Line 132 “often approximately log normal” and it would be good to specify that the underlying assumption is for only perennial streams.

Line 135 sentence requires citing a reference.

Also, here the converse is the real problem: interpreting the “results” without the transformation. While the units would be ‘correct’ the model assumptions would not be. It is also possible to transform, analyse, and retransform so the units are ‘correct’ but you face asymmetric confidence limits.

Line 139 “Student’s-t”

Line 144 use “normal distribution” rather than “normal condition”

Line 145 Perhaps the text regarding Tukey’s contributions in general is tangential?

Line 148 “Neither option is ideal” “Neither option is acceptable”?

Line 150 “Since Tukey’s work, some alternatives have emerged.” “better” seems to be a convenient opinion.

Line 156 “RMSE is inefficient (or more inefficient) for error…”

“Lacking an alternative, MAD is a popular choice.”

Line 164 “Log transforming”

Line 165 Not sure I would agree that this provides “reasonable” results. There are methods for dealing with the zero elements and non-zero elements separately.

Line 167 “… so the difference between 0.001 and 1 is same as that between 1 and 1000.

Line 172 “Log-likelihood is equivalent to the concept of entropy from information theory”. While true, it seems tangential to the main argument and distracts the reader.

Line 180 “… normal versus Laplace; Burnham and Anderson …”

Line 189 use “non-zero” instead of ‘positive’

Line 195 chose a word other than “naively” “without considerable thought”

Line 200 combining metrics is certainly not meaningful. Neither is a checklist of metrics without criteria or benchmarks. This is a place where the issue of “how can I use my model?” follows. What do the metrics tell you about the limits of model applicability?

Line 201 replace “best” with “most important with respect to model application.”

Citation: https://doi.org/10.5194/gmd-2022-64-RC2
- AC3: 'Reply on RC2', Timothy Hodson, 26 Apr 2022
  
  Thank you taking time to review my manuscript. Your comments were insightful and I will address all of them, except one that I didn't understand, which I note in my response.
  Major Comments
  
  --------------
  RC2: I question the suggestion that there is a debate between Willmott’s papers and Chai and Draxler (2014). In my mind, a debate has some back and forth and here we only see a comment by Chai & Draxler, but nothing in response from Willmott. Perhaps an alternate word would be more precise.
  AC: I will frame the debate as between RMSE and MAE, in which Chai and Willmott are a recent installment.
  
  RC2: While I found the paper interesting, I returned several times considering the question “Who is the audience?” I think that presenting in a broader fashion for a wider audience would widen its appeal. Probably it is just me being ‘old school’ but the use of “we” is sometimes confusing as it is not specific that it is referring to only the author, or the wider community.
  AC: My audience are readers of Willmott and Chai (>3000 citations each): Earth scientists with limited statistical training who use least squares (MSE) and least absolute deviations (MAE) extensively but have little-to-no awareness of formal likelihood methods. To paraphrase Burhnam and Anderson's (2001) comparison of least squares to likelihood methods, "likelihood methods are much more general and far less taught."
  
  I attempted to write a brief pedagogical paper that describes how the familiar terms like RMSE and MAE arise from likelihood theory, gives examples of likelihoods' greater generality, then refers the reader to the important textbooks on this topic. The discussion of MAD is somewhat tangential to likelihood methods, but robustness is a common point in the “debate” between RMSE and MAE, and MAD is relevant in that respect.
  
  RC2: On the use of value laden words [e.g. better, best, simpler, and harder]:
  AC: Good point. I'll find better descriptors.
  
  RC2: What is the metric for? How does it need to be applied? How does this affect how my model can be used? Too often, the metrics are simply a checklist and little thought seems to be applied to questioning whether the model is fit for purpose. How does any metric help address “Do you get the right answer for the right reasons?”
  AC: The task is simple, in theory: choose the metric that will identify the most likely (“realistic”) model; for normal iid errors, minimizing the MSE yields the most likely model.
  
  I agree with the reviewer that too often we blindly apply “checklists.” This paper introduces a more theory-based approach to evaluation, which has long been the standard other fields like ecology and economics.
  
  RC2: Another area that needs mentioning is that models are fit to data; data that often is assumed to be without error.
  AC: Observational error is a common application of likelihood and Bayesian methods. I could mention some relevant texts, though I'm less familiar with the history.
  
  RC2: One area that needs clarification is the application to models. The manuscript never clearly suggests the modeling framework intended; the key issue is often that the observations and model output are time series and the residuals are unlikely to be iid but strongly autocorrelated. This is particularly true since the hydrological examples are for rainfall-runoff modeling. But, the arguments regarding MAE and MSE apply to random sampling as well.
  AC: I suppose the main framework would be “likelihood methods,” though not exclusively. All rational frameworks (Bayesian, significance testing, etc) can be derived from probability theory. A point of the paper, is to remind readers that MAE versus MSE is a false dichotomy and whatever framework they choose, they should understand how it derives from probability theory. Autocorrelation is an important topic that could be addressed within the likelihood framework, but may be a bit advanced. Like Chai and Willmott, I sought to write a short readable paper but also to orient readers to the existing literature. I reference several papers that discuss autocorrelation in the context of rainfall-runoff modeling. I will think about a better general reference that I could cite.
  
  Specific Comments
  
  -----------------
  RC2: Line 15 “L1-norm and L2-norm” would be better to explain that L1-norm is Manhattan distance and L2-norm is Euclidean distance. Could explain more fully here.
  
  AC: I'll note that, but I'd prefer to omit the equations. The conceptual link is important, but the equations are tangential here, I think.
  RC2: Line 18 is it actually a ‘debate’?
  
  AC: Would ‘discussion’ or ‘discourse’ be better? I'm not sure. According to one source, a debate is "a formal discussion on a particular topic in a public meeting or legislative assembly, in which opposing arguments are put forward." I believe that definition consistent with Willmott and Chai. I will also try to frame the ‘debate’ as the two-century-long debate, in which Willmott and Chai are a recent iteration.
  RC2: Line 27. I like the “historical” presentation.
  
  AC: Thank you.
  
  RC2: Line 31. Would be good to add a bit more guidance.
  
  AC: I'm uncertain what sort of guidance was intended. This line describes how many reference works give the proofs behind MSE and MAE but neglect to give a primary reference.
  RC2: Line 37 insert after observations “and models”
  
  AC: revised
  
  RC2: Line 37 for choosing between MAE and RMSE
  
  AC: revised
  RC2: Line 38 “for the complex error …”
  
  AC: revised
  RC2: Line 39 I would choose a better word than “Occasionally” “Where more concrete examples were needed, the examples were drawn from hydrology, particularly rainfall-runoff modeling.”
  
  AC: revised
  RC2: Line 43 delete “In their debate,”
  
  AC: revised
  RC2: Line 59 delete “Despite its simplicity,”
  
  AC: revising as "This simple equation provides..."
  There are several places in the text where value laden words as used loosely.
  RC2: Line 61 “simpler problem of deduction” “harder problem of induction” Not sure that these value words help as they take a stance that is not necessary.
  
  AC: I believe this is an accurate summary of the frequentist argument. I try not to take a strong stance on this subject, but I would say I'm Bayesian in theory but sometimes frequentist in practice.
  
  RC2: Line 72 “under certain conditions” It would be better to explain those conditions.
  
  AC: The next sentence transitions into a more detailed discussion, which states these conditions.
  RC2: Line 81. Replace “Subsequent sections will …” with something like “Ways of relaxing these assumptions will be introduced below.” [The subsequent text covers much more that what was indicated here.]
  
  AC: revised
  
  RC2: Line 90 Choose a better word than “trick”. Perhaps “operation”? There was a case in recent memory where emails from East Anglia referred to a ‘trick” that the media ceased as evidence of subterfuge and dishonesty.
  
  AC: Good point. This is a "convenient practice."
  
  RC2: Line 125-127 seems awkward. The text regarding using likelihood functions to inform choices of model structure seems tangential.
  
  AC: revised
  
  RC2: Line 127 insert a period after “… 2001)”
  
  AC: revised
  
  RC2: Line 132 “often approximately log normal” and it would be good to specify that the underlying assumption is for only perennial streams.
  
  AC: revised
  
  RC2: Line 135 sentence requires citing a reference.
  
  AC: I've found several papers that claim this, but none of them are primary. I think the point is that if we're comfortable thinking in linear or proportional scales, but it's more difficult to reason about other nonlinear scales.
  RC2: Also, here the converse is the real problem: interpreting the “results” without the transformation. While the units would be ‘correct’ the model assumptions would not be. It is also possible to transform, analyze, and retransform so the units are ‘correct’ but you face asymmetric confidence limits.
  
  AC: The confidence limits are symmetric in geometric space; the error is multiplicative rather than additive. See Limpert et al. (2001): Log-normal distributions across the sciences.
  
  RC2: Line 139 “Student’s-t”
  
  AC: revised
  
  RC2: Line 144 use “normal distribution” rather than “normal condition”
  
  AC: revised
  
  RC2: Line 145 Perhaps the text regarding Tukey’s contributions in general is tangential?
  
  AC: True, Tukey was seminal but also intermediary. I did include several historical tangents lest we repeat them.
  
  Many people use MAE as a robust metric because of Tukey's work, including Willmott, so I wanted to place Tukey in context. However, the primary purpose of the paper is pedagogical, so I'd considering omitting this comment if it is distracting.
  
  RC2: Line 148 “Neither option is ideal” “Neither option is acceptable”?
  
  AC: I prefer ideal (or optimal). I want to advocate that these are better practices, but I don't want to go so far as to claim that all others are unacceptable, though some arguably are (by “better”, I mean more efficient, more accurate, etc.).
  
  RC2: Line 150 “Since Tukey’s work, some alternatives have emerged.” “better” seems to be a convenient opinion.
  
  AC: Better in respect to the preceding statement "their performance degrades as deviation grows." Will revise as "more robust"
  
  RC2: Line 156 “RMSE is inefficient (or more inefficient) for error…”
  
  AC: revised
  RC2: “Lacking an alternative, MAD is a popular choice.”
  
  AC: revised
  
  RC2: Line 164 “Log transforming”
  
  AC: revised
  
  RC2: Line 165 Not sure I would agree that this provides “reasonable” results. There are methods for dealing with the zero elements and non-zero elements separately.
  
  AC: Those methods are discussed later, though you're right that “reasonable” was a poor choice as it implies they are “logical” or “rational.” Describing these methods as “practicable” or “sometimes satisfactory” would be better. XXX TODO
  
  RC2: Line 167 “… so the difference between 0.001 and 1 is same as that between 1 and 1000.
  
  AC: revised
  
  RC2: Line 172 “Log-likelihood is equivalent to the concept of entropy from information theory”. While true, it seems tangential to the main argument and distracts the reader.
  
  AC: Will omit.
  
  RC2: Line 180 “… normal versus Laplace; Burnham and Anderson …”
  
  AC: revised
  
  RC2: Line 189 use “non-zero” instead of ‘positive’
  
  AC: This sentence is describing the zero-inflated lognormal, so it is strictly positive. True, streamflow can take negative values, so lognormal isn't ideal, but that is an open problem. My intent is only to demonstrate how formal likelihoods offer additional flexibility beyond MSE, which assumes errors are normal and equal variance.
  
  RC2: Line 195 chose a word other than “naively” “without considerable thought”
  
  AC: revised
  
  RC2: Line 200 combining metrics is certainly not meaningful. Neither is a checklist of metrics without criteria or benchmarks. This is a place where the issue of “how can I use my model?” follows. What do the metrics tell you about the limits of model applicability?
  
  AC: I'll add some brief guidance here. I focused on methods for determining the most likely model. Assessing confidence or applicability is a related topic (via probability theory), but I hadn't planned on addressing it in this paper.
  
  RC2: Line 201 replace “best” with “most important with respect to model application.”
  
  AC: Will simply say "while exposing readers to several alternatives for when they fail"
  
  Citation: https://doi.org/10.5194/gmd-2022-64-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Timothy Hodson on behalf of the Authors (18 May 2022) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (18 May 2022) by Riccardo Farneti

RR by Anonymous Referee #2 (01 Jun 2022)

Suggestions for revision or reasons for rejection

T.O. Hodson
Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not

Second review.
I remain concerned that that the author uses ‘debate’ when there is no formal discussion. I think this is a typical written scientific discussion which I would term a discourse. It may be suitable to use debate in places where the history is the issue, as in the abstract and introduction, but not for the papers by Wilmott and Chai & Drexler alone.

I am satisfied that the author has adequately addressed my comments on the first draft. While I appreciate that he is focused on minimizing errors as a criteria for choosing between models, I remain of the opinion that what the model will be used for needs to be a consideration and some comment in the discussion would be appropriate.

Detailed comments:
The quoted text at line 20-23 should be inside double quotation marks.

Line 24: try
The statement may have adequately characterized the application in geosciences but not in statistics.

Line 31
“… references to the primary literature.”

Line 33
“… otherwise, any inference will be biased.”

Line 41
“… arguments both for and against MAE …”

Line 42-44 Difficult long sentence.
Instead, my focus is on an important omission in these papers: neither explains the theoretical justification for MSE and MAE, which derives from the laws of probability, which themselves derive from the laws of logic (Jaynes, 2003); that is to say, there are logical reasons for choosing one over the other.

Neither of these two papers explains the theoretical justification for MSE and MAE. MSE and MAE are derived from the laws of probability, which themselves are derived from the laws of logic (Jaynes, 2003); thus, there are logical reasons for choosing one over the other, which is my focus here.

Line 71
“… is also the more likely, but …”

Line 84
It might help an uninformed reader if the use of Π was explained.

Line 126 ??
“these candidates (Burnham and Anderson, 2001, e.g.,).”

“… these candidates (e.g., Burnham and Anderson, 2001).”

Line 150 I am still of the opinion that this is not a debate, so suggest that:

“In their debate, Willmott et al. (2009) recognize robustness as an important advantage of MAE, though Chai and Draxler (2014) never directly acknowledge …”

“In this discourse, Willmott et al. (2009) recognize robustness as an important advantage of MAE, though Chai and Draxler (2014) never directly acknowledge …”

Line 225

Replace ‘debate’ with ‘discourse’

Line 226-228 suggest changing:
Whereas, Willmott and Matsuura (2005) and Willmott et al. (2009) were correct that MAE is more robust, though there are better alternatives.

To

Though Willmott and Matsuura (2005) and Willmott et al. (2009) were correct that MAE is more robust, there are better alternatives.

Line 229-230 I think that the final statement should be more affirmative:

Hopefully, this paper fills that gap by explaining why and when these metrics work, while exposing readers to several alternatives for when they fail.

This paper fills that gap by explaining why and when these metrics work, and exposes readers to several alternatives for when they don’t.

Hide

ED: Publish subject to minor revisions (review by editor) (03 Jun 2022) by Riccardo Farneti

AR by Timothy Hodson on behalf of the Authors (17 Jun 2022) Author's response Author's tracked changes

EF by Una Miškovic (21 Jun 2022) Manuscript

ED: Publish as is (22 Jun 2022) by Riccardo Farneti

ED: Publish as is (23 Jun 2022) by David Ham (Executive editor)

AR by Timothy Hodson on behalf of the Authors (23 Jun 2022)

Short summary

The task of evaluating competing models is fundamental to science. Models are evaluated based on an objective function, the choice of which ultimately influences what scientists learn from their observations. The mean absolute error (MAE) and root-mean-squared error (RMSE) are two such functions. Both are widely used, yet there remains enduring confusion over their use. This article reviews the theoretical justification behind their usage, as well as alternatives for when they are not suitable.