The root-mean-squared error (RMSE) and mean absolute error (MAE) are widely used metrics for evaluating models. Yet, there remains enduring confusion over their use, such that a standard practice is to present both, leaving it to the reader to decide which is more relevant. In a recent reprise to the 200-year debate over their use,

The root-mean-squared error (RMSE) and mean absolute error (MAE) are two standard metrics used in model evaluation. For a sample of

In what have become two classic papers in the geoscientific modeling literature,

The RMSE has been used as a standard statistical metric to measure model performance in meteorology, air quality, and climate research studies. The MAE is another useful measure widely used in model evaluation. While they have both been used to assess model performance for many years, there is no consensus on the most appropriate metric for models errors.

The statement may have accurately characterized the application in geosciences but not in statistics. Among statisticians, the answer was common
knowledge, at least to the extent that there can be no consensus. Different types of models have different error distributions and thus necessitate
different error metrics. In fact, the debate over squared versus absolute error terms had emerged, was subsequently forgotten, and re-emerged over the
preceding 2 centuries

It is unclear exactly when this “no-solution solution” became common knowledge, in part because contemporary authors rarely cite their sources. While
reviewing the literature, I found proofs in several reference works, including the venerable

As this review will show, the choice of error metric should conform with the expected probability distribution of the errors; otherwise, any inference will be biased. The choice of error metric is, therefore, fundamental in determining what scientists learn from their observations and models. This paper reviews the basic justification for choosing between RMSE or MAE and discusses several alternatives better suited for the complex error distributions that are encountered in practice. The literature on this topic is vast, and I try to emphasize classic papers and textbooks from the statistical literature. To make that discussion more concrete, I include several examples from hydrology and rainfall–runoff modeling, though none of the techniques are exclusive to that field. The discussion is primarily written for Earth scientists who use RMSE or MAE but have little-to-no awareness of formal likelihood methods.

Like all inference problems, the justification begins with Bayes' theorem,

In the absence of any prior information the prior distribution

This relation provides the basis for “frequentist” statistics, first recognized by

First, the case of normally distributed (Gaussian) errors. Consider a normally distributed variable

To find the most likely model, we begin with the likelihood given by the normal distribution,

A convenient practice is to take the logarithm of the likelihood, thereby converting the products to sums

Now consider an exponentially distributed random variable, with a concrete example being daily precipitation, which is often approximately exponential in
distribution. If both model predictions and observations are

Assuming the Laplace distribution better represents the error than the normal, we should prefer the model maximizing the Laplacian likelihood
function,

To summarize the previous two sections: for normal errors, minimizing either MSE or RMSE yields the most likely model, whereas for Laplacian errors,
minimizing MAE yields the most likely model. Normally distributed variables tend to produce normally distributed errors, and exponentially distributed
variables tend to produce Laplacian-like errors, meaning that RMSE and MAE are reasonable first choices for each case, respectively. Technically both also
assume the errors are

The first option is to refine the structure of the model; in other words, make the model more physically realistic. While this option is the most
important for the advancement of science, it is not relevant to the choice of error metric, and thus I will not discuss it further, other than to note that
likelihoods can also be used to evaluate model structure: first, determine the maximum likelihood for each candidate model structure, then select (or
prefer) the most likely among these candidates

The second option is to transform the data to a Laplace or, more commonly, normal distribution and minimize the RMSE or MAE of the transformed
data to yield the most likely model. For example, the streamflow distribution of a perennial stream is approximately lognormal. Logging a lognormal
variable yields a normal one, so in log space the error is the difference between two normal distributions, which will also tend to be normal. If the
errors can be made normal by transformation, then minimizing the MSE of the transformed variable will yield the most likely model. Many statistical
methods assume normality, and the general name for transforming a non-normal variable into a normal one is known as a Box–Cox transformation

The third option is to use “robust” methods of inference. The term “robust” signifies that a technique is less sensitive to violations of its
assumptions; this typically means they are less sensitive to extreme outliers. To achieve this, robust techniques replace the Gaussian likelihood with
one with thicker tails, such as the Laplace or the Student's

While

In this discourse,

Since Tukey's work, more robust alternatives have emerged, including the median absolute deviation or MAD,

In addition to being “robust,” MAD and MAE also preserve scale, unlike the formal likelihood-based approach discussed next. Unless combined with a transformation, scale-preserving error metrics have the same units as the data, such that their magnitude roughly corresponds to the magnitude of the typical error. While MAD and MAE are easy to interpret and implement, they are somewhat limited in scope in that they are only appropriate for “contaminated” distributions – mixtures of normal or Laplace distributions with a common midpoint, which are also symmetric by implication.

More complicated error distributions are frequently encountered in practice. For example, errors in rainfall–runoff models are typically heteroscedastic. Log transforming the data can correct this for positive streamflows values, but the log is undefined when streamflow is zero or negative. Simple workarounds, such as setting zeros to a small positive value, may be satisfactory when zero and near-zero values are relatively rare but blow up as those values become more frequent. Recall that in log space errors are proportional, and thus the difference between 0.001 and 1 is the same as that between 1 and 1000.

The final option, likelihood-based inference, is the most versatile and subsumes the others in that each can be incorporated within its framework. Its
main drawback is interpretative. The absolute value of the likelihood is meaningless, unlike RMSE or MAE, which measure the typical error. Their
relative values are meaningful, however, in that the likelihood ratio represents the evidence for one model relative to another. For an accessible
introduction to likelihood-based model selection, the reader is referred to

Metrics like RMSE and MAE are sometimes referred to as “informal” likelihoods because in certain circumstances they yield results equivalent to
those obtained by the “formal” likelihood

For rainfall–runoff modeling, examples of the formal likelihood-based approach include

RMSE and MAE are not independent, so how should we weigh their relative importance when evaluating a model if both are presented? Assuming no prior
information, the logical approach is to weigh them by their likelihoods. According to the law of likelihoods, the evidence for one hypothesis versus
another corresponds to the ratio of their likelihoods

If the evidence strongly supports one over the other, presenting both metrics is unnecessary and potentially confusing. If their evidence is similar,
it may be appropriate to present a weighted average or present both metrics along with their weights

Although the likelihood can provide an objective measure of model performance, we are often concerned with multiple facets of a model, such that any one performance metric is insufficient. A common solution is to define and compute several metrics, each chosen to characterize a different aspect of the model's performance. For example, in rainfall–runoff modeling a modeler may compute the error in flow volume (the model bias) and the errors at a range of flow quantiles.

When evaluating these metrics, there is a tendency to combine them into an overall score, but such scores are not inherently meaningful, at least in a
maximum-likelihood sense. A better, or at least safer, approach is to focus on a single objective function, e.g., MSE for normally distributed
errors. For the normal case, minimizing the MSE (or normal log likelihood) is optimal because it minimizes the information loss (as information and
negative log likelihood are equivalent). Typically we want to know more about a model than its general performance, however, such as how well it
performs at specific tasks. For that reason, we may choose to compute ancillary metrics or (more formally) decompose the likelihood into components
representing specific aspects of a model's performance

This review has focused primarily on how probability theory can answer the question “which model is better?”, thereby guiding the task of model selection. But this task is equivalent to asking “how accurate is my model?”, comparing competing models, and selecting the most accurate. That first step – quantifying the uncertainty in a model – is important in its own right, especially if we base decisions on predictions from our models. Just as the Gaussian likelihood provides the theoretical basis for using RMSE to quantify model uncertainty when errors are normally distributed, other likelihood functions are used to evaluate model accuracy and confidence intervals for other error distributions.

Probability theory provides a logical answer to the choice between RMSE and MAE. Either metric is optimal in its correct application; though neither may be sufficient in practice. For these cases, refining the model, transforming the data, using robust statistics, or constructing a better likelihood can yield better results. Arguably the latter is most versatile, though there are pragmatic reasons for preferring the others.

Returning to the discourse over MAE and RMSE,

No data sets were used in this article.

The author has declared that there are no competing interests.

Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Funding for this research was provided by the Hydro-terrestrial Earth Systems Testbed (HyTEST) project of the U.S. Geological Survey Integrated Water Prediction program.

This research has been supported by the U.S. Geological Survey (Hydro-terrestrial Earth Systems Testbed (HyTEST) project).

This paper was edited by Riccardo Farneti and David Ham, and reviewed by Paul Whitfield and one anonymous referee.