Introduction
Integrated assessment models (IAMs) couple human and physical Earth systems
to explore the impacts of economic and environmental policies
. IAMs are usually calibrated to
a historical base year and simulate forward in time by incorporating changes
in quantities such as population, GDP, technology, and policy to produce
outputs that include land use, emissions, and commodity prices. Hindcast
experiments use a model to produce a forecast simulation over a time period
for which observational data are available. The ability to compare simulation
data with observational data presents new opportunities for understanding a
model's strengths and identifying avenues for improvement, and raises new
research questions to explore. A variety of hindcast studies on IAMs of
varying scale have used different metrics for evaluation studies, often
driven by the research question of interest
.
However, no community standard for evaluation of IAMs currently exists,
making it more difficult to compare results of hindcast experiments from
different models. This work outlines goals for evaluating IAM hindcast
experiments.
The Global Change Assessment Model version 3.0 (GCAM;
) was
recently used to conduct a hindcast experiment .
, hereafter referred to as Paper 1, used skill
scores to
compare performance of the land use module of GCAM under structurally
different operating assumptions to an observational dataset. The different
scenarios represent different extremes of information for decision making
given to the GCAM economic agents. One finding of this hindcast experiment
with GCAM 3.0 was that the highly aggregated nature that makes the skill
scores examined convenient also masks important deficiencies, limiting the
insight they can provide for model development. A key question raised by this
experiment, and which this work examines in greater detail, is how to
actually define “improvement”. The ease of use of skill scores has to be
balanced with illuminating as many model deficiencies as possible. Only once
a definition of improvement has been decided upon can parameter estimation
studies be undertaken, as ranging over parameter values is only a useful task
if one can quantitatively identify the parameter values that give the best
agreement with historical data.
From this work, four goals for development of an IAM hindcast evaluation
scheme were identified. A desirable evaluation method will provide information
about the absolute performance of a single model run and may be used to
measure the relative performance of multiple model runs (from a single model or
across many models of the same variables). Additionally, we seek a method
that can describe multiple aspects of model performance on multiple scales,
providing a flexible organizational structure for analyzing the large amount
of data generated by IAMs while investigating particular hypotheses of
interest. And finally, the method should include at least one metric that can
be used as a cost function in future Monte Carlo-style parameter estimation
experiments. Given these goals, it is unlikely that a single metric could be
arrived at to satisfy all four. Rather, a condensed set of related metrics
that together accomplish all four goals is sought for evaluating
IAMs. The result of applying the set of metrics to model runs may be
interpreted to identify future avenues for model improvement of a
particular IAM. The implementation of such improvements is outside the scope
of this paper.
Our evaluation goals are not independent of each other. A metric that
provides absolute performance insight can be calculated for multiple model
runs and compared to provide relative performance information. A metric
evaluating a particular aspect of model performance may be used to estimate
parameters to improve that aspect of model performance.
Several other works in the IAM hindcasting literature
do not meet all four of our goals. For example, in the
hindcast experiment performed for the energy component of the AIM/CGE (Asia-Pacific Integrated Models/Computable General Equilibrium) model,
present two statistics: a regression
technique and an error statistic for global aggregates. The regression
technique identifies regions and variables for which model performance may be
improved. While the regression technique can produce desirable
region-specific information about model performance and shortcomings for
multiple variables, it unfortunately cannot be leveraged as a performance
metric for future Monte Carlo-style parameter estimation exercises. It is
also difficult to efficiently and comprehensively compare the regression
results of multiple different scenarios to evaluate whether one scenario
represents an overall better performance than another.
A common finding to both of these hindcast experiments is that global
performance of a variable is often substantially better than the performance
in individual regions.Therefore, while this work will explore global
aggregates as previous analyses did, we find that global aggregates alone are
not sufficient to evaluate IAMs that require global supply to equal global
demand at each time period. GCAM is only one example of such an IAM.
The analysis scheme outlined below is designed with the four evaluation goals
in mind and focuses on deviation-based measures of model performance and the
extent of conclusions that may be drawn from them. While many other model
performance statistics exist, many operate on a pass–fail basis and therefore
provide little insight into the reasons a model may fail. The scheme is then
used to reexamine the land use data from Paper 1 to demonstrate application
of the evaluation method and the resulting expanded results relative to
application of skill scores.
Evaluation methods
A proposed scheme to meet the four evaluation goals inspired by past IAM
hindcasting experiments is outlined below. This work explores the extent of
conclusions that may be drawn from the root mean square error (RMSE) measure
of model performance and finds that different uses of RMSE allow the
possibility of addressing all four evaluation goals. While arguments against
RMSE in favor of mean absolute error (MAE) exist
, RMSE is chosen because
it can be decomposed into errors from different sources
.
If only a single deviation measure were being examined, the types of
conclusions that could be drawn would not differ appreciably whether RMSE or
MAE is used. However, the ability to decompose RMSE provides unique
opportunities to understand different aspects of simulation performance.
Indices of agreement are popular in the literature and generally involve the
comparison of a deviation measure between simulated and observed time series
with some reference measure
.
Common reference measures include deviation measures between the observed
data points and the mean of observations, or deviation measures between the
observed data points and a baseline or naive model of the variable being
simulated. Consistent with the idea of examining different reference
measures, we normalize the root mean square error in different ways to
capture different facets of model performance. Other members of the
geoscientific modeling community are also moving to assess model performance
with multiple normalized statistics, although we differ in specific
techniques . These indices of agreement are
particularly useful for evaluating model scenario performance in absolute
terms due to the informative performance benchmarks outlined in Sect. . Other goodness-of-fit statistics such as correlation or
a reduced chi-squared statistic were not chosen because they offer less
information to guide improvements when a model displays poor performance.
Background: root mean square error decomposition
In the statistics outlined below, the value of variable i in region j at
timestep t is denoted by stij for simulation and otij for
observation. Each time series contains N discrete time points. The
deviation measure of error chosen for model evaluation is the root mean
square error, denoted for variable i in region j by
eij=1N∑t=1Nstij-otij2.
Root mean square error is the total deviation error in the model, decomposed as follows:
eij2=bij2+vij2,
where bij represents bias and vij represents errors due to variability. Bias of variable i in region j is given by
bij=sij‾-oij‾,
where sij‾ is the mean of the simulated time series and
oij‾ is the mean of the observed time series. The errors due
to variability are those remaining after bias is accounted for by subtracting
the means of the simulation and observation. The centered root mean square
error quantifies this error and is denoted by
vij=1N∑t=1Nstij-sij‾-otij-oij‾2.
Metrics for model evaluation
Past hindcast experiments in integrated assessment models have implied that errors across regions cancel each other out, leading to better performance at the global level than in most individual regions
. We define the time series for
the global region, G, by concatenating the time series for each individual
region. Therefore, for J total regions whose time series each contain N
data points, the global time series contains JN data points. To quantify
the extent to which cancellation across regions occurs, bias is examined at
the global level in two ways. First, the bias for the global region is
examined, noting that it is mathematically equivalent to averaging the
individual region biases:
biG=siG‾-oiG‾=1J∑j=1Jbij.
Second, we define global absolute bias as follows:
|biG|=1J∑j=1J|bij|.
By comparing the magnitudes of Eqs. () and (),
the extent of cancellation occurring across regions may be quantified for
each variable i.
On a regional level, normalization provides context for interpreting the
errors in Sect. . The conventional normalization of root
mean square uses the standard deviation of the observed time series,
σoij. Normalized RMSE of variable i in region j is given by
eij′=eijσoij.
The quantity eij′ gives a dimensionless measure: total error as a fraction of
the standard deviation of observation of variable i in region j.
Similarly, the centered RMSE may be normalized by the standard deviation of
observation, to give the errors due to variability as a fraction of the
observed standard deviation. Normalized centered RMSE of variable i in
region j is given by
vij′=vijσoij.
The normalization used in Eqs. () and ()
compares deviation measures to the observed variance about the temporal mean.
However, that variance encompasses the trend line behavior. Therefore, we
also normalize RMSE for variable i in region j by the observed variance
about the trend line, following the convention of comparing deviation
measures to a selected baseline to provide more targeted information about
model performance
.
For each variable i in each region j, let y^(t) be the trend line
fitted to the observational data, with yt^ the values at the discrete
time steps considered. Then we define the standard deviation of observation
about the trend line as
σ^oij=1N∑t=1Notij-yt^-otij-yt^‾2.
For the true trend line, y^(t), the mean otij-yt^‾=0. However, in numerically fitting the trend line, the mean is
often not precisely 0. We can then define revised normalized RMSE by
normalizing with the standard deviation about the trend line rather than
about the time mean as follows:
e^ij=eijσ^oij.
One advantage of this refined measure is that e^ij penalizes poor
simulation of the observed trend line more heavily than eij′.
Another advantage is that, if the trend line is believed to be true to
reality, the variance about the trend line will encapsulate natural
variations (such as those due to weather) as well as observational
uncertainty.
For the GCAM land use case study defined in Sect. , FAO
observational data for each crop–region combination was individually
detrended using the function loess.as from the R package fANCOVA
to fit the LOESS trend line, selecting the
bias-corrected Akaike information criterion (AICC) method for generating the
span parameter .
Informative performance benchmarks
While the time series statistics outlined in Sect. have
clear values corresponding to perfect model performance (i.e., a value of 0),
specific criteria for acceptable and good model performance are more
difficult to define objectively. In this section, we outline ways in which to
contextualize the values achieved by each statistic outlined above to
identify opportunities for model improvement.
For eij′ and eij, a helpful performance benchmark is defined as
eij′=eijσoij<1⟺eij<σoij.
Recall that the definition of standard deviation is σoij=1N∑t=1Notij-oij‾2. The right-hand side of this equation is also what the root mean square error would be
for a model taking stij=oij‾ at each time step t.
Satisfying Eq. () gives some sense of whether total error is
small enough without achieving a perfect value of 0. It is popular to say
that if eij′>1, using the mean of the observed time series as a
model leads to better performance than the current model. This interpretation
is identical to that of the Nash–Sutcliffe efficiency
. However,
for a nonstationary distribution of observations, the observed mean can only
be calculated after the simulation period and therefore cannot be used as a
model. When eij′>1, either the bias or the variability component
of RMSE (or both) is too large. Therefore, when eij′>1, it is
most useful to examine whether vij′<1. In this case, improving bias
may allow the model to satisfy Eq. ().
Statistics for model evaluation
Abbreviation
Description
Normalized by
Notes
biG
global bias
lacks absolute
performance info
|biG|
global absolute bias
lacks absolute
performance info
eij′
regional normalized
standard deviation around
RMSE
time mean of observation
vij′
regional normalized
standard deviation around
centered RMSE
time mean of observation
e^ij
revised regional
standard deviation around trend
normalized RMSE
lline of observation
A case study of GCAM 3.0 land allocation
The data described below and analyzed in Sect. are from
the first GCAM land use system hindcast experiment, Paper 1. The land
allocation data are reanalyzed using the method outlined in Table in order to determine whether this method is more likely to
achieve our four goals than the skill scores originally used. This
demonstration is why we have chosen to reevaluate existing experiments
rather than repeat or develop new experiments in a more up to date version of
GCAM. The full complement of resulting statistics and figures are available
online with code and data; see the data availability section.
GCAM background and data for reanalysis
GCAM is an integrated assessment model capturing the interactions between
human and Earth systems. GCAM includes energy, economic, and land
use sectors that interact with each other and with a climate model. It is
designed for long-term forecasting and is typically operated in 5-year
timesteps. Model behavior is calibrated to a historical base year using
observational data, and forecasts evolve in time from the base year.
Therefore, social, economic, and environmental policies in place during the
base year are implicitly reflected in GCAM's performance. Policies that begin
later, or change over time, must be more thoughtfully included, often
explicitly.
Full details of the GCAM land use system, including equations, are provided
in as well as in the online documentation1. Full
details of different aspects of GCAM's structure are published in a variety
of papers .
Briefly, the land use system of GCAM has a nested structure. In each
subregion within a geopolitical region, a nested structure is implemented
with data specific to the subregion. The land allocation choice at each
branch in the nest is parameterized to reflect that subregion's
characteristics and may vary in response to economic, policy, and
technological changes.
Economic agents in each subregion operate to maximize the difference between
revenue (including any taxes and subsidies) and the cost of production. The
land use system assumes a distribution of costs, where the amount of land
allocated for each use is actually the probability that land type is most
profitable within its nest and avoiding winner-take-all behavior. That is,
land is allocated to various possible uses via a logit distribution function
at each branch of the nest. All references to GCAM within this work may be
assumed to refer to GCAM version 3.0, unless otherwise specified.
Historical data prior to 1990 were used to calibrate GCAM 3.0, and then GCAM was run for a period from 1990 to 2010 without using
additional historical data (i.e., GCAM is used to forecast agricultural land
use from 1990 to 2010). There are 9 GCAM crops (of 12) with historical
data reported by the United Nations Food and Agricultural Organization (FAO; ) during the period 1990 to 2010. The same analysis scheme outlined
in Sect. and demonstrated here could just as easily be
used to examine any variable output by an IAM with historical data available
for validation.
The reference setup of GCAM 3.0 (and all subsequent versions to date) for
forecast into the 21st century uses smoothed FAO projections of yields as
exogenous yield input information that is used by GCAM to simulate land
allocation. The smoothing is performed as a 5-year rolling average
including past and future years (i.e., the smoothed 2040 data point is
generated as the average of data from 2038 to 2042).
Because GCAM requires global supply to equal global demand to solve for
market prices at each time step, it is possible for GCAM economic agents to
implicitly optimize land allocation to meet global demand at minimum cost,
even though GCAM is a dynamic recursive rather than an optimization model.
When the economic agents are given unrealistic fore-knowledge of the impacts
of weather events, for example, this implicit optimization may become
particularly problematic. GCAM endogenously calculates a global market price
(where global supply equals global demand) during the simulation period. This
global market price is used to set producer prices used by economic agents in
profit calculations underlying land allocation decisions, and every land use
region shares the same global producer price. A global market price is needed
for model calibration in the base year. Since such data do not currently
exist, the USA producer price is used as the global price for calibration.
This choice could lead to incorrectly incorporating or missing impacts of
policies like subsidies or crop insurance programs. On the demand side, the
price is sterilized in the GCAM calibration procedure.
Paper 1 featured experiments designed to investigate the possibility of
unrealistic implicit optimization and examined two extremes of exogenous
yield inputs via different parameterizations. The extremes also emphasize
different aspects of the GCAM reference setup, and so the reference setup
behavior is assumed to lie between the behaviors of the two extremes. The
first extreme features increased variability in exogenous yield inputs
compared to the GCAM reference. This is referred to as the actual yield case:
GCAM makes planting decisions (allocates land) in 2005 based on knowing what
the yield at the end of the year in 2005 will be, a case of economic agents
having unrealistic levels of information for making planting decisions. There
is no smoothing at all, and there is no explicit memory of past years'
performance. The other extreme features a lack of variability and no updates
to exogenous yield inputs during the simulation period 1990–2010, as opposed
to the reference setup. This is referred to as the forecast yield case: a
linear regression is fit to the historical yields over 1961–1990 and
extrapolated linearly for the simulation period 1990–2010. There is no
variation about this linear trend and economic agents have no fore-knowledge,
contrasting the actual yield case.
To examine the impact of missing or incorrectly characterizing a policy,
Paper 1 examined the US Renewable Fuel Standards implemented in 2005. The
standards, among other things, increased demand for corn. GCAM runs without
any implementation of the policy were compared with GCAM runs in which the
increased demand for corn was explicitly included. Future scenarios
interested in deeper analysis of the impacts of the US Renewable Fuel
Standards may use a more detailed implementation or may make use of the
metrics outlined in Sect. to perform a Monte Carlo-style
parameter estimation for parameters related to the fuel standards.
Global bias, biG (Eq. ). The black circle corresponds to bi,G=0.
These considerations result in the following four test cases (scenarios) examined in Paper 1:
GCAM makes annual land allocations given data for population, income, and actual crop yields (denoted by AY);
GCAM makes annual land allocations given data for population, income, and actual crop yields and includes an
estimate of the additional demand for corn resulting from the implementation of the US Renewable Fuel Standards (denoted by AYB);
GCAM makes annual land allocations given data for population and income, but crop yields are forecasted based
on an annual time trend for the years 1961 to 1990 (denoted by FY);
GCAM makes annual land allocations given data for population and income, but crop yields are forecasted based on
an annual time trend for the years 1961 to 1990, and includes an estimate of the additional demand for corn resulting
from the implementation of the U.S. Renewable Fuel Standards (denoted by FYB).
The simulated regional data in each of these four scenarios are compared to
data reported by the FAO during the period 1990 to 2010 for the
nine GCAM crops with FAO data available. found
that the case FYB performed as well as or better than the other scenarios across
the skill scores considered: Reichler–Kim ,
normalized mean absolute error ,
and Taylor skill . Scenarios AY and
AYB generally performed the worst.
Results
A selection of results demonstrating how the evaluation
method summarized in Table can be used to analyze multiple
aspects of model performance at multiple scales and how the metrics may be
used to make the analysis of the large amounts of data produced by IAMs more
tractable are presented. The results presented were chosen both to illustrate
the general types of insights that may be drawn from application of the
evaluation scheme and to highlight the GCAM areas of strong performance and
weak performance, with the full results for all variables on all scales by
all metrics lying somewhere in between the results presented in this section.
Each metric in Table is used to reexamine the Paper 1 data,
demonstrating the interactive and complementary nature of the metrics
selected. With this approach, we are able to verify and expand the previous
GCAM land hindcast results arrived at using skill scores in Paper 1. The
analysis scheme does appear to be more capable of achieving all four evaluation
goals than the skill scores. The full complement of resulting statistics and
figures are available online with code and data; see data availability
section
for details.
Figure shows the global bias, Eq. (), which is
equivalent to the average of each individual region's bias. Because it is a
signed quantity, a black circle is included at bi,G=0 for visual
reference. Each scenario models global supply well for each crop with
observational data available, as measured by global bias biG. The
primary exceptions are that the scenarios AY (red) and AYB (green) model
MiscCrop and OtherGrain poorly. This is not surprising, given that each of
those crops is an aggregate of a large number of real-world crops, varying
across regions.
Global absolute bias, |biG| (Eq. ).
Figure shows the global absolute bias, Eq. (). For each crop, the magnitude of the global absolute bias
in Fig. is larger than the magnitude of the global bias in
Fig. , indicating that errors are canceling each other out across regions.
Because there are no regional constraints on supply to supplement the
requirement that global supply equal global demand, there are numerous
regional supply solutions that may satisfy the global constraint. This
provides ample opportunity for error cancellation across regions in any
integrated assessment model with a similar global constraint.
The FYB scenario (purple) displays the smallest absolute bias for all crops,
with the exception of Rice and OtherGrain, in Fig. . In
other words, the FYB scenario is most successful at modeling global supply
when cancellation across regions is prohibited.
Normalized RMSE, eij′ (Eq. ), in each region for the land allocated to
Wheat. The black circle is at the performance benchmark, eij′=1
(Eq. );
eij′ compares RMSE error with the standard deviation of observation for each crop.
The compensating errors across regions can be further studied by examining
the normalized RMSE, eij′ (Eq. ), for a single
crop. Figure displays the individual regional errors for
Wheat. A black circle is included to denote the performance benchmark
eij′=1 (Eq. ). With the exception of Southeast
Asia, the forecast yield scenarios (FY, blue, and FYB, purple) outperform the
scenarios using actual yield information (AY, red, and AYB, green). Scenarios
FY and FYB show that compensating performance is occurring: the good
performance in Canada, eastern Europe, and the USA is balanced by the poorer
performance in Australia, New Zealand, India, Latin America, and Southeast
Asia. Similar trends hold when examining other crops.
To further understand the role of compensating errors in GCAM land
allocation, the role of bias as a contributing factor is examined. Because
root mean square error decomposes into bias and centered root mean square
error (Eq. ), a sense of whether bias is too large can be
gained from comparing eij′ (Eq. ) and
vij′ (Eq. ). If eij′>1 and
vij′<1, bias may be considered a problematic source of errors.
This is generally what occurs in GCAM.
Normalized RMSE, eij′ (Eq. ), for each crop in the United States.
A black circle is included for eij′=1; eij′ compares RMSE error with the standard
deviation of observation for each crop.
Normalized centered RMSE, vij′ (Eq. ), for each crop in the United
States. A black circle is included for vij′=1; vij′ compares centered RMSE error
with the standard deviation of observation for each crop.
Figure displays the normalized RMSE, eij′, for
each crop in the United States. A black circle is included for
eij′=1. In the FYB scenario (purple), eij′>1 for
every crop except Wheat.
Figure displays the normalized centered RMSE,
vij′, for each crop in the United States. A black circle is
included for vij′=1.
The FYB scenario (purple) displays vij′<1 for all crops except
Rice and Root Tuber. Compared with the larger values of eij′ in
Fig. , this indicates that bias is a major contributing
factor to performance issues. This general trend – that scenario FYB performs
best and that bias is the major contributor to model performance issues for
most crops – holds across regions.
It would be preferential for the bias to be improved intrinsically through
structural or parametric model changes, rather than through bias-correction
techniques. Therefore, we examine which factors contribute to bias. The
revised normalized RMSE, e^ij (Eq. ), compares
GCAM performance to variations of the observed time series about the trend
line. Figure displays this metric for each crop in the
USA. A black circle is included for e^ij=1. Each crop in each
scenario misses the trend line behavior. With the exception of Rice, scenario
FYB (purple) comes closest to capturing the trend line behavior. This result
holds for most crops in most regions. Therefore, scenario FYB is one possible
starting place in making structural improvements to GCAM.
To further examine the ways in which simulations may improve at capturing
trend lines, time series for corn and wheat for multiple
regions are depicted in Fig. a and b, respectively. The black curves are FAO
observational data for land allocation in each region, and the colored time
series correspond to the different GCAM scenarios.
Revised normalized RMSE, e^ij (Eq. ), for each crop in the United States.
A black circle is included for e^ij=1; e^ij compares RMSE error with the standard deviation
about the observed trend line for each crop.
The time series for both corn and wheat illustrate a key issue: GCAM tends to
incorrectly simulate whether land allocation should increase or decrease in
time. The FYB scenario for Wheat (Fig. b) tends to be the
most accurate, consistent with the results depicted in Fig. . It is of note that the actual yield scenarios (AY, red,
and AYB, green) are also susceptible to inaccurate discrimination between
increasing and decreasing land allocation, showing that it is not improved by
economic agents in GCAM having perfect information about year-end yields to
make planting decisions.
One possibility for the incorrect direction of simulated trends is that the
parameters involved in the land allocation decision may be improved, by
changing the calibration process and/or by using parameter estimation to
adjust the logit exponents governing competition. Another option may be to
explore the impacts of using different distributions to govern competition.
That the AY (red) scenario displays different performance than the AYB
(green) scenario reinforces the importance of careful implementation of
policies: explicitly including the effects of policies (such as in AYB) leads
to different performance than assuming policies are implicitly included in
the information provided to the model (as in AY, a case where real-world
yields that should implicitly reflect the increased demand due to the US
Renewable Fuel Standards).
Finally, the time series for corn in the former Soviet Union and wheat in
China both suggest an opportunity for structural changes to improve the land
allocation performance of GCAM. The yields for both of these crops display
different slopes during the simulation period than the historical period.
Therefore, the extension of the historical yield trends used in the FY and
FYB scenarios has no hope of correctly capturing the different yield behavior
during the simulation period. In turn, GCAM has no hope of capturing the
different land allocation decisions in response to those yield changes. In
contrast to the FY and FYB scenarios, the AY and AYB scenarios lead to GCAM's
land allocation being very responsive to variability in yield inputs. One
hypothesis is that this is because the economic agents in GCAM have
unrealistic access to year-end harvest amounts when making their planting
decisions. This local-yield input information may allow GCAM to meet global
demand without matching historical data due to the lack of regional supply
constraints.
Time series for land allocated to corn (a) and wheat (b) in units of thousands of square
kilometers across select
regions. The black time series in each panel represents FAO observational data. The colored time series correspond
to different GCAM scenarios.
GCAM-specific conclusions
Using the evaluation method outlined in Table , we expand the
results presented in Paper 1. Like many IAMs, GCAM requires that global
supply equal global demand for each commodity in each time period. The FYB
scenario in GCAM models global supply well, as measured by global bias
biG (Fig. ). GCAM, at least, has no regional constraints on
supply to supplement the global supply and demand constraint. As a result,
there are numerous regional supply solutions that may satisfy the global
constraint. This provides ample opportunity for error cancellation across
regions, demonstrated in Fig. .
We find that the main opportunity to improve land allocation decisions in
GCAM is to make structural and parametric changes to improve the trend line
for each simulated time series and therefore improve bias. The scenario using
yields forecasted from the historical period and modeling the US Renewable
Fuel Standards (scenario FYB) generally performs the best across all metrics
and is the most reasonable starting point to begin model improvements.
Specifically, updating the yield forecast as new information becomes
available each year in the simulation period would allow the yield to capture
changes occurring during the simulation period while avoiding the
over-responsiveness of the scenarios using actual yields as inputs (scenarios
AY and AYB). Changes to parameters, calibration methods, and data sources for
producer prices may also improve the land use system's ability to discern
whether land allocation trend lines should increase or decrease in time for a
given crop–region combination. The metrics in Table may be
used for parameter estimation studies. In using GCAM to forecast into the
future (where an AY scenario is not possible), providing the ability to adapt
to shifts in yield occurring during a simulation period and the ability to
better predict whether a land allocation trend line should increase or
decrease in response to a yield shift would both be improvements.
Because the GCAM reference exogenous yield inputs lie between the two
extremes examined in Paper 1 and here, one expects a hindcast experiment with
the reference setup to have errors between those of the AY and FY cases.
However, because the reference scenario has exogenous yield inputs based on
FAO forecasts of yields, it is possible that the reference scenario
may perform substantially worse than any of the cases examined in this work.
This could occur if FAO forecasts of yields are dramatically inaccurate.
Because planting decisions are not subject to the kind of “vintaging” seen with
power plant construction, it is unlikely that errors will compound in
unexpected ways. A planting decision (in GCAM) only lasts for the year in
which it occurs. A power plant construction lasts for 30+ years. This lack
of vintaging makes it simpler to evaluate the land sector than other sectors
of GCAM. Therefore, while the evaluation method outlined in this work can
still be applied to sectors that feature vintaging, the results must be
interpreted much more carefully. It is possible that additional metrics may
have to be implemented for sectors with vintaging, and rigorous studies
designed to specifically test the extent to which vintaging causes errors to
compound may be undertaken in the future.
Conclusions
Examination of past hindcasting exercises in the IAM community has suggested
that global aggregate metrics are often not well suited to evaluating IAM
hindcast performance. This work has outlined a suite of metrics designed to
counteract this problem and has demonstrated that the family of metrics
presented is able to provide richer insight into model performance than
global skill scores by reevaluating the results of a past hindcast
experiment in GCAM.
Further, applying the evaluation method outlined in Table
allows insight into evaluating IAMs beyond GCAM. While global results in GCAM
are largely consistent with observations, cancellation of errors is present
at the global level, a finding implied by previous hindcasting work in two
different IAMs . Any IAM
requiring globally balanced supply and demand without additional regional
constraints will likely encounter this same issue. This suggests a larger
challenge in evaluating integrated assessment models: replicating global
aggregates is a necessary but in no way sufficient constraint on model
performance. Indeed, many IAMs force global supply to equal global demand,
and so global aggregates of many variables in IAMs simply reflect this forced
behavior. Therefore, a family of validating metrics is found to be necessary
in evaluation of IAM hindcast experiments. The option to evaluate results
both relatively and absolutely should lead to more robust model improvements
in the future by identifying the best-performing scenarios for a single
model, as well as aid the IAM community in conducting hindcast
intercomparison studies.
A sector by sector application of a family of metrics may be necessary for
evaluation of an IAM hindcast experiment as a whole. Future research into
more tractable methods for simultaneous evaluation of all IAM sectors without
masking deficiencies as global aggregates do is necessary to determine
whether
this is the case. Such work is complicated by the lack of historical data
against which to validate many IAM variables. Additionally, one may question
whether the observational data being used for validation are reliable.
Collecting global economic data is difficult and there is no opportunity for
repeated measurements to obtain measurement uncertainty. When fitting trend
lines to the FAO data for use in the revised normalized RMSE metric,
e^ij (Eq. ), it became clear that in at least
some regions the data may not be a reflection of reality. Namely, for some
crops in Korea and Japan (among other regions), there is almost no variation
about the trend line. There also was no available FAO data to validate three
crops and other land types modeled by GCAM. Therefore, a better sense of
observational uncertainty is necessary before parameter estimation based on
observational data can take place.