A crop yield change emulator for use in GCAM and similar models: Persephone v1.0

. Future changes in Earth system state will impact agricultural yields and, through these changed yields, can have profound impacts on the global economy. Global gridded crop models estimate the inﬂuence of these Earth system changes on future crop yields but are often too computationally intensive to dynamically couple into global multi-sector economic models, such as the Global Change Assessment Model (GCAM) and other similar-in-scale models. Yet, generalizing a faster site-speciﬁc crop model’s results to be used globally will introduce inaccuracies, and the question of which model to use is unclear given the wide variation in yield response across crop models. To examine the feedback loop among socioeconomics, Earth system changes, and crop yield changes, rapidly generated yield responses with some quantiﬁcation of crop response uncertainty are desirable. The Persephone v1.0 response functions presented in this work are based on the Agricultural Model Intercomparison and Improvement Project (AgMIP) Coordinated Climate-Crop Modeling Project (C3MP) sensitivity test data set and are focused on providing GCAM and similar models with

Abstract. Future changes in Earth system state will impact agricultural yields and, through these changed yields, can have profound impacts on the global economy. Global gridded crop models estimate the influence of these Earth system changes on future crop yields but are often too computationally intensive to dynamically couple into global multisector economic models, such as the Global Change Assessment Model (GCAM) and other similar-in-scale models. Yet, generalizing a faster site-specific crop model's results to be used globally will introduce inaccuracies, and the question of which model to use is unclear given the wide variation in yield response across crop models. To examine the feedback loop among socioeconomics, Earth system changes, and crop yield changes, rapidly generated yield responses with some quantification of crop response uncertainty are desirable. The Persephone v1.0 response functions presented in this work are based on the Agricultural Model Intercomparison and Improvement Project (AgMIP) Coordinated Climate-Crop Modeling Project (C3MP) sensitivity test data set and are focused on providing GCAM and similar models with a tractable number of rapid to evaluate dynamic yield response functions corresponding to a range of the yield response sensitivities seen in the C3MP data set. With the Persephone response functions, a new variety of agricultural impact experiments will be open to GCAM and other economic models: for example, examining the economic impacts of a multi-year drought in a key agricultural region and how economic changes in response to the drought can, in turn, impact the drought.

Introduction
Agricultural yields are susceptible to changes in temperature, precipitation, growing season length, CO 2 concentrations, and other Earth system factors. While both the nature of the future climate and its impact on agricultural yields are uncertain Pirttioja et al., 2015;Fronzek et al., 2018;Asseng et al., 2013Asseng et al., , 2015Martre et al., 2015;Lobell, 2013), it is clear that there is potential for identifying the important effects on agriculture and, in turn, the economic state of the world at large. The global multi-sector economic Global Change Assessment Model (GCAM) 1 (Kyle et al., 2011;Wise et al., 2014;Calvin et al., 2019;Hartin et al., 2015) and other similar-in-scale models (Nelson et al., 2014) are ideal for understanding the far-reaching impacts of this climate-agriculture-economic cycle but rely on external projections of agricultural yields to quantify these effects (Fig. 1a). This asynchronous process results in inconsistencies between the economic and biophysical world, and overlooks feedbacks and unintended consequences as the future shifts .
Several modeling groups, including the GCAM model development team, are interested in explicitly modeling and understanding bidirectional feedbacks between the Earth and the human systems (e.g., Fig. 1c). Agriculture is one important pathway (of many) through which these systems directly interact. A prime example would be to study the impacts of a multi-year drought in a key agricultural region. The drought A. Snyder et al.: Persephone v1.0 would affect yields, which would affect the agricultural supply to the global economic market. In a model like GCAM, this would lead to price changes and shifting land to more profitable crops. The new spatial distribution of agricultural land would change land-related emissions, which will in turn affect climate and therefore yields moving forward. Being able to model each component of this process and the interactions among them is key to considering important questions like this one.
Currently, GCAM operates on a 5-year time step and is coupled with a physical Earth system emulator, Hector (Hartin et al., 2015) (as in Fig. 1a, b), to explore global change questions in rapid enough evaluation times to allow for large numbers of simulations to be analyzed as part of a wide range of experiments. GCAM is a recursive dynamic partial equilibrium model that is calibrated to a historical base year of 2010 and used to simulate forward in time by incorporating changes in quantities such as population, GDP, and technology to produce outputs that include land, water, and energy use as well as emissions and commodity prices. For agricultural production in GCAM, yield change trends representing generally positive change assumptions over time due to non-climate factors (changes in management, new seed genetics, new technologies, use of chemicals/fertilizers, adaptation, etc.) are used to calculate the profitability of a crop-irrigation-fertilizer combination in each of 384 GCAM land units at each time step based on the global crop price. This profitability determines land allocated to each crop, and the combination of exogenous yields and land allocation gives production of each crop-irrigationfertilizer combination such that global supply and global demand are met on each time step. The details of this allocation are provided in Kyle et al. (2011), Wise et al. (2014), and Calvin et al. (2019). Shifting land allocation among different crop-irrigation-fertilizer combinations leads to a degree of endogenous yield intensification within GCAM.
Past agricultural impact studies using GCAM (Calvin and Fisher-Vanden, 2017) have focused on using outputs of global gridded crop model (GGCM) studies (e.g., Rosenzweig et al., 2014;Elliott et al., 2015;Müller et al., 2017) in a strictly feed-forward way (Fig. 1a). Direct coupling of a GGCM to GCAM would result in a computationally expensive modeling framework, limiting the number of simulations that could be performed. Yet, large ensembles of simulations are necessary to explore and understand future response options, so there is great need for a computationally efficient model that could explore the uncertainty space. While GCAM is already coupled to a simple climate model, Hector (Hartin et al., 2015), this coupling is one way: emissions are passed to the climate model, but to date dynamic feedbacks between climate and humans at each time step are missing. In this paper, we describe the first version of Persephone (v1.0), a simple representation of mean agricultural response and uncertainty to future climate that can be incorporated into GCAM and similar models. Further details of the desired studies this yield change emulator would be used for are given in Sect. 2.1 and discussed at length in Ruane et al. (2017).
An ideal solution to the computational expense of coupling a GGCM to GCAM is a yield response emulator, which uses past crop yield model runs to predict what the model would have done under different conditions, had it been run. However, previous work in this area has been restricted to either emulating crop model results under fixed [CO 2 ]-temperature pathways such as the various Representative Concentration Pathways (RCPs) (Oyebamiji et al., 2015;Blanc, 2017;Ostberg et al., 2018) or building statistical models from empirical and historical data (Lobell, 2013;Moore et al., 2017;Mistry, 2017;Mistry et al., 2017). While an emulator trained on RCP-driven scenarios can be used to estimate yield change in any future climate, the RCPs only span a subset of possible future climates. In particular, should one want to consider the impacts of [CO 2 ]-temperature pathways that substantially differ from the RCPs, these emulators would face the difficult task of predicting yield changes outside of the conditions of the training data. Statistical models of empirical and historical data also must predict yield changes in response to future climate outside of the conditions of the training data, especially in response to large [CO 2 ] increases. Substantial departure from the RCPs and historical values of [CO 2 ] is very possible in the bidirectional coupled human-Earth system applications outlined above and an emulator equipped to handle that is desirable. Finally, many of these past studies have lacked a way to capture aspects of uncertainty that would be useful for the GCAM bidirectional feedback experiments described in Sect. 2.1.
The Agricultural Model Intercomparison and Improvement Project (AgMIP)  took steps to begin addressing these issues with the Coordinated Climate-Crop Modeling Project (C3MP), a modeling study specifically designed to, among other things, provide the data necessary to develop a flexible and dynamic crop yield emulator McDermid et al., 2015). C3MP invited point-based crop modelers from across the AgMIP community to simulate their calibrated agricultural system's response to 99 sensitivity tests in which 1980-2009 baseline climate data were modified to synthesize changes in mean carbon dioxide concentration ([CO 2 ]), temperature, and precipitation. The 99 carbon-temperature-precipitation (denoted CTW, W for "water" rather than P for "precipitation") tests that make up the C3MP protocol were selected using a Latin hypercube to ensure that future scenarios through the end of the 21st century, including all RCPs, fall within the training model simulation data over the vast majority of agricultural lands . The full space of CTW changes that these 99 tests represent is 330-900 ppm global [CO 2 ], −1 to +8 • C from local baseline temperature, and −50 % to +50 % from local baseline precipitation (applied as a multiplicative factor). A particular CTW perturbation could be associated with a specific time slice; Figure 1. The current method for incorporating agricultural impacts into GCAM and two experimental designs for using Persephone v1.0 with GCAM. (a) The current method for incorporating yield changes from a global gridded crop model into GCAM. (b) A partially coupled feed-forward study incorporating yield changes from a predetermined climate scenario into GCAM. (c) A fully coupled feedback loop that iteratively updates agricultural yield impacts.
for example, the 2050s climate changes from a given Earth system model (ESM) RCP4.5 projection or from a climate condition generated within GCAM as a result of interactions between socioeconomic development and the natural environment. Finally, the C3MP study featured broad spatial coverage (albeit not uniform) of a wider variety of crop models, crops, and management practices than has been incorporated into past GGCM or emulator work. More than 50 participating crop modelers helped C3MP record yield response simulation results from a total of 1135 sites, differing by location, crop species, cultivars, crop model, farm management, etc.
The Persephone framework presented in this work is designed to develop yield response functions to CTW changes from a given data set. The Persephone v1.0 response functions, based on the C3MP data set, provide a computationally inexpensive estimate of the change in agricultural yield due to a change in the Earth system and make use of the promising data relating yield changes to CTW changes collected in C3MP. Specifically, we present biologically reasonable response functions that are rapid to evaluate and more dynamic than past options for incorporating crop responses into models like GCAM. We strictly considered responses to long-term Earth system changes. The C3MP results or other appropriate data sets could be further used to examine the effect of interannual variability on yields in Persephone v2.0 and beyond, although this would require additional complexities in seasonal yield variations that are largely averaged out in long-term trends. The response functions also represent the uncertainty in yield response across crop models in the C3MP data set to a given change in local Earth system state, for use in three types of agricultural impact studies with GCAM: 1. The first is a partially coupled feed-forward study ( Fig. 1b) similar to methodology in Ruane et al. (2018). A future climate time series of interest (a nontraditional RCP, climate stabilization level, or hypothetical drought, for instance) is input to the yield response functions, returning yield changes. These yield changes are applied as multipliers to GCAM input files and GCAM is run forward for the entire time period of interest in order to trace the broad impacts on energy, water, and land use of the future climate time series. In this type of study, we only capture the implications of climate for human systems.
2. The second type of study is a fully coupled feedback loop that updates on every model time step to understand how societal pressures drive environmental impacts which in turn create or reduce societal pressures (Fig. 1c). In this case, the yield changes must be calculated very quickly in order to evaluate on each step and interact with GCAM. In this type of study, we can capture the effects of humans on climate and climate on humans, simultaneously.
3. The third type of study is joint climate-crop uncertainty studies of the above two experiments. For tractability, the GCAM development team specifically seeks a mean response function as well as two additional response functions that represent a range of yield response uncertainty. Persephone also stores the full predictive distributions of yield changes for any given CTW change that these three response functions span. If a user desires a different representation of uncertainty, the distribution may be sampled.

C3MP data set
Full details of the C3MP protocols, design, and the location output archive can be found in Ruane et al. (2014) andMcDermid et al. (2015. Here, we highlight some of the key features of the data set and outline our processing of C3MP data for using the Persephone framework to train v1.0 response functions with the Persephone framework. C3MP recorded yield response simulation results from a total of 1135 sites (differing by location, crops, crop model, management, etc.) for each of 99 CTW sensitivity tests designed to cover a range of CTW changes that most future climates would fall into. For each site, each CTW test is applied to change a local time series of weather data from 1980 to 2009 and then the crop model is run to produce 30 years of impacted yields for the CTW test, which are then averaged.
The C3MP design resulted in a wider range of crops than had been previously sampled in a coordinated agricultural modeling study. We separate the C3MP data into 25 different production groups for training in the Persephone framework to create v1.0 response functions. A total of 24 of the 25 groups for this paper are collections of sites corresponding to different crop-irrigation-latitude combinations: irrigated and rain-fed versions of six key crops (maize, rice, wheat, soybeans, a C 3 -photosynthesis average, and a C 4 -photosynthesis average [CO 2 ]), based on sites at the extended tropics (30 • S to 30 • N) and the midlatitudes (30-70 • S, 30-70 • N) (see Sect. 2.1.1 for more details on spatial scales). It is also noteworthy that the majority of C3MP sites had high rates of fertilizer application, even in the extended tropics. These six crop groups were chosen because most integrated assessment models (IAMs) already have experience incorporating such impacts from previous AgMIP exercises (e.g., Ruane et al., 2017;Calvin and Fisher-Vanden, 2017;Nelson et al., 2014;Wiebe et al., 2015;Ruane et al., 2018), they cover the major agricultural commodities globally, and they offer additional benchmarks for evaluating emulator success. In particular, the C 3 -photosynthesis production groups represent an average response of a very wide range of C 3 crops, including wheat, rice, and soybeans. The C 4 -photosynthesis average is similarly defined, with sugarcane considered separately. The 25th production group is rain-fed sugarcane in the extended tropics: no sugarcane sites outside of 30 • S to 30 • N were submitted to C3MP and only one irrigated sugarcane site was submitted.
We cull the 1135 contributed C3MP output data sets according to a range of criteria: 1. Sites simulated with notably older versions of crop models are eliminated. We thus eliminated uses of the Decision Support System for Agrotechnology Transfer (DSSAT) crop model v3 (and prior), given that important updates in crop physiology were added in version 4 (Jones et al., 2003).
2. Site simulations that exclude CO 2 fertilization responses, a fundamental variable examined here, were eliminated. We thus eliminated the SarraH-Hv32 crop model (primarily millet and sorghum sites in west Africa).
3. When C3MP modelers provided simulation sets that were identical other than the use of local weather data or the NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) for AgMIP (Ag-MERRA) climate forcing data , we used only the local data set to avoid double counting. AgMERRA was provided for all data sets given frequent data gaps and governmental restrictions .
These steps together eliminate more than 550 of the C3MP sites. Finally, for each production group, outliers are statistically identified and eliminated (Davies and Gather, 1993;Bond-Lamberty et al., 2014), in addition to those previously identified by the C3MP steering team. A total of 575 unique sites remain after culling, maps of which are included in Fig. 2. These remaining sites cover 43 countries, 85 models, and 17 crop species. More than half of the C3MP sites have been eliminated, but this still results in a larger number of diverse sites, models, and crop species performing coordinated sensitivity tests than in any previous study Pirttioja et al., 2015;Fronzek et al., 2018). Since C3MP, the AgMIP-Wheat team has conducted an extensive analysis of temperature response at 30 wheat sites with 30 models , but this only captures one of the CTW dimensions.

Known caveats of the C3MP data set
Additional discussion of the C3MP data set in the context of other AgMIP modeling efforts is presented in Ruane et al. (2017). One relevant point to this work is that, while C3MP spatial coverage is not spatially uniform or production weighted for any of the crops under consideration, sites for many of the major production regions are represented for each crop (Fig. 2). A major advantage of using site-specific crop models run voluntarily by experts is that the individual baseline runs at each site have been configured against local information in the historical period. However, the application of crop yield response from these sites to estimate response in any given grid cell with temperature and precipitation data is imperfect by its methodological nature. Yet, this extension is necessary for use with GCAM: gridded yield changes for a subset of crops must be aggregated and converted to yield impact multipliers for each GCAM commodity in each land unit, defined as water basins in GCAM (Calvin et al., 2019).
Given the size and details of the C3MP data set, production groups were formed based on two latitude zones as a way to account for baseline local temperature (which is important in addition to the change from local temperature) without having to eliminate the many valid C3MP sites that could not report local weather data due to data gaps or local government restrictions. As this breakdown already results in some production groups with small sample sizes (see Table 1 and Sect. 3.1.1), further spatial disaggregation of production group is unjustified in this data set. While this means there will be limited spatial granularity in yield response functions, there can still be appreciable spatial granularity in yield changes due to variation in the gridded fields of temperature and precipitation changes. Future data sets with more comprehensive spatial coverage than the C3MP data may be used rather to create v2.0 response functions.
The site-specific percent change in yield from the 1980-2009 baseline yield is the dependent variable used to train our emulator (see Sect. 2.2). Baseline yields differ widely across the C3MP archive due to regional and system differences; however, the percent change in yield from baseline is more consistent across sites for each CTW. Further, by training on change in yield rather than yield, we are able to introduce additional, scientifically grounded constraints to the functional forms we fit (Eqs. 4-6). However, no baseline simulation was requested under the C3MP protocols. Therefore, for each individual set of output yields corresponding to each of the 575 simulation sites, we perform ordinary least squares regression for eight different functional forms relating the sitespecific output yield to the input CTW values and select the best-performing regression to estimate baseline yield (details in Appendix B, Eqs. B1-B8).
It is also worth noting that the C3MP experimental protocols McDermid et al., 2015) do not account for changing growing seasons, either through changes of within-season distribution of temperature and rainfall or in the possible autonomous adaptation of farmers to shift planting and harvest dates. Ruane et al. (2014) showed that within-season distribution changes had a small effect and the possible shift in planting and harvest dates are a topic of adaptation. Modeling autonomous adaptation behaviors is a challenging area for coordinated agricultural efforts and is only beginning to be addressed in coordinated sensitivity intercomparison studies as a scenario option, with no publicly available data sets at this time.

Emulation
The majority of past agricultural yield emulator work has used ordinary least squares regression to estimate coefficients of functional forms. Given a set of predictors, x, and given a particular value of the predictors x i with corresponding training data y i , an emulator would be some linearin-parameter function f (x) that returns an emulated value f (x i ) for comparison with y i . Ordinary least squares regression requires that residuals r i = y i −f (x i ) ∼ N (0, σ 2 ) for all i (e.g., Williams and Rasmussen, 2006, Sect. 2.1.1). A key requirement is that σ is a constant value across all i. Figure 3 displays the spread of yield responses across sites for each CTW test for one production group, rain-fed soybeans between 30-70 • S and 30-70 • N (the midlatitudes). A successful emulator will produce the mean response (Fig. 3, black dots) across sites for each CTW. Therefore, examining the spread of the individual site yield changes about the mean yield gives some sense of the behavior of residuals in the most successful emulation case.The spread of yield change across sites relative to the mean response is different for each CTW test and appears to change in a systematic way -larger Figure 2. Maps of the C3MP data set culled sites. Each site represents a site-specific model of a single crop, with differing management practices. The sites are overlaid on Monfreda et al. (2008) harvested area data, except for the C 3 and C 4 averages. magnitude changes in yield are correlated with greater spread across sites. In light of this, a classic, ordinary least squares regression is not an appropriate approach for this emulator. We also desire more than just the mean response: we desire a measure of how this variation of site responses changes with CTW. With these considerations in mind, we take a slightly different approach to creating the Persephone v1.0 response functions, working from texts such as Gelman et al. (2013), Sivia andSkilling (2006), andMcElreath (2016).
We create the Persephone v1.0 response functions to emulate the mean yield response and two additional yield response scenarios spanning a range of individual site responses. For a given production group (crop-irrigationlatitude zone combination), we collect the data for the 99 CTW tests for each of K C3MP simulation sets drawn from the culled-down archive. In other words, for each of 99 CTW combinations, there exist K 30-year average yield percent changes from the baseline (no changes in CTW) for a group. This ensemble of 99K yield changes is used to calculate the posterior densities for every parameter of µ CTW and σ CTW in the model defined by Eqs. (1)-(7) according to Bayes' theorem (posterior ∝ likelihood × prior). From the posteriors, the maximum a posteriori (MAP) estimates of parameters, the most plausible value for each parameter given both the model being used and the training data, is returned.
We define our likelihood as a normal distribution with mean µ CTW and variance σ 2 CTW : For a production group with site-specific yield responses that are normally distributed for each CTW value, µ CTW is the mean response across sites for that CTW value (the black points in Fig. 3), and σ 2 CTW is a measure of agreement (or disagreement) of responses across sites for that CTW value. We Figure 3. A plot of the percent yield change at each rain-fed soybeans in the midlatitudes site (blue points) for each CTW test (each horizontal line of points is a different test). The black dot for each test represents the mean response across the sites for that test.
present results for our most broadly optimal mean and variance functional form combination in this paper, and present the details of our selection criteria among the different functional forms in the Appendix A.
To have unitless coefficients in our emulator, all predictor variables are standardized. Defining the collection of 99T changes sampled by C3MP as T C3MP , the collection of precipitation changes as W C3MP , and the collection of CO 2 concentrations as C C3MP , we have T baseline is a change of 0 • C from baseline, W baseline is a 0 % change in precipitation from baseline, and C baseline is 360 ppm. Plugging these baseline values into Eq.
We exploit the fact that we are emulating change in yield (and not yield) and the fact that T baseline = W baseline = C baseline = 0 in constructing Eqs. (4)- (7), which relate the mean and standard deviation of the likelihood in Eq. (1) to our unitless predictor values C, T , W . By definition, percentage change in yield in response to no change in CTW is 0 % at baseline for every individual C3MP site. This implies that both mean and variance at baseline are 0 for all production groups, and we must construct the Persephone response functions to reflect this, independent of the estimated baseline yield at each site: Implementing this constraint for the mean is straightforward. Any functional form representation of µ CTW that does not include a constant parameter a 0 will force µ baseline = 0 % yield change precisely because T baseline = W baseline = C baseline = 0.
Constraining the variance to be 0 at baseline as in Eq.
(3) should be equally easy by simply not considering any functional form that includes a constant parameter. However, this approach leads to numerical stability issues when estimating parameters. Therefore, we estimate the variance using the following functional form: This results in the following functional form representation for the standard deviation: This functional form estimates parameters that may individually be negative but which together result in a nonnegative standard deviation for any CTW value being considered. At baseline, this functional form representation has standard deviation σ baseline = |b 0 | as opposed to the required σ baseline = 0 in Eq. (3). This is done for numerical reasons and is addressed with the prior for b 0 ∼ N (0, 0.01 2 ). This constrains the value of b 0 to be between −0.02 % and 0.02 % with 95.45 % probability, reflecting that b 0 should be as close to 0 as possible without causing numerical solver issues. This results in σ baseline values between 0 % and 0.02 %, and therefore 0 % ≤ Y emulated baseline ≤ 0.02 %, which we judge acceptable for incorporating into GCAM as a multiplier. All other parameters have very broad priors: The functional form for µ CTW is equivalent to estimating the coefficients of a third-order Taylor polynomial, which can approximate a wide variety of functions fairly well. Similarly, the functional form for σ CTW is conceptually related to estimating the coefficients of a second-order Taylor polynomial. Because of the C3MP experimental design, emulating yield changes throughout the 21st century using Eqs. (1)- (7) does not require extending beyond the range of mean growing season CTW values used to train the Persephone v1.0 response functions. These functional forms are an evolution from C3MP's hybrid polynomial ). An exploration of other functional forms to address potential overfitting is included in Appendix A. Ruane et al. (2014) also reviews previous emulator forms across the literature, including discussion of the potential to look at non-linear terms such as killing degree days used in Schlenker and Roberts (2009).
From the model defined by Eqs.
(1)- (7), we construct the three Persephone v1.0 response functions for each production group: The default high and low responses are at 1 standard deviation of the production group yield responses (as opposed to 2 or 3) because we are interested in scenarios that capture a range of the simulated site responses but not the most extreme simulated site response. This does not affect how µ and σ are fit in Persephone v1.0, only how they are used. The Persephone v1.0 code is written flexibly enough that a user more interested in capturing the most extreme simulated site response could certainly add a multiplicative factor (e.g., µ + 2|σ |) when using µ and σ without having to spend the computational time refitting.

Evaluation
We primarily present figures and analysis using the model and response functions defined by Eqs. (1)-(8) because we found these functional forms to be the most broadly optimal of those considered. To investigate overfitting, we also examined nine other possible functional form combinations of µ CTW and σ CTW for each production group, defined in Eqs. (A1)-(A7). Details of the cross-validation experiments used as a method of functional form selection are in the Appendix A. Briefly, because we are interested in the ability of any given response function to accurately predict yield changes in response to CTW values not used for training, we perform leave-one (CTW test)-out cross-validation experiments for each production group. The best-performing functional form at the cross-validation experiments is then the selected functional form. This can be done to find the most broadly optimal functional form (using the same functional form for all production groups; Fig. A1) or to find the best functional form for each production group (if a user wishes to vary the functional form for each production group; Table A10). This choice does not introduce additional fitting or computational time. It is changed only by the calls to each function in the Persephone R package by the user.
Here, we quantitatively evaluate the performance of the Persephone v1.0 response functions (Eq. 8) trained on the full span of CTW values that the 99 tests represent for each production group (Sect. 3.1). We also present heuristic evaluations of mean response function performance (Sect. 3.2).
Files with the point estimate, as well as the standard deviation of the posterior distribution, for each coefficient in µ and σ for all 10 functional form combinations for all production groups are available (archived at https://doi.org/10.5281/zenodo.1414423; Snyder et al., 2018) and as part of the Persephone v1.0 R package (https: //github.com/JGCRI/persephone).

Quantitative
We categorize the performance of the Persephone v1.0 response functions trained on the full span of CTW values (mean, high, and low response; Eq. 8) for each production group based on comparing the 99 emulated yields output from the response functions to the 99 corresponding values from the C3MP simulation data: the in-sample measurement of error. These are the actual response functions an end user would have and it is important to have a performance measure for them, although this is not the performance measure used to select functional forms.
The categorization is based on the normalized root mean square error (NRMS) and the comparison for each response function is as follows: -The 99 emulated yields returned by the mean response function are compared to the mean yield response across the production group C3MP sites for each of the 99 sensitivity tests (what we call the simulated mean yields).
- As noted in Willmott (1984), Legates and McCabe (1999), and Snyder et al. (2017), NRMS < 1 is one benchmark for adequate model performance, NRMS < 0.5 is a benchmark for good model performance, and NRMS = RMSE = 0 is perfect model performance. We further subdivide these categories and define excellent in-sample performance as NRMS ≤ 0.25 for all three response functions; good performance is 0.25 < NRMS ≤ 0.5 for at least one response function and NRMS ≤ 0.25 for at least one response function; adequate performance to be all three response functions having NRMS < 1 but at least one response function with 0.5 < NRMS < 1; and finally poor performance occurs when any one of the three response functions has NRMS ≥ 1. The mean response function performs excellently for all of our production groups, although the performance of the high and low response functions differs. These measures are presented in Table 1 for the response functions defined using cubic µ CTW (Eq. 4) and quadratic σ CTW (Eq. 6) for all production groups. The excellent performance of the mean response function holds across all functional form combinations explored (Tables A1-A9). In the event that a user is only concerned with a mean response scenario, a shared functional form for all production groups is acceptable. A user interested in the high and low response functions may wish to use the production-group-specific functional form combinations listed in Table A10, which includes the in-sample performance metric for the optimal functional form for each production group. The majority of production groups (17/25) feature excellent in-sample performance, while the remaining eight production groups feature good overall performance. For more details than the summary tables presented here, files of results for the leave-one-out cross-validation exercises for all functional form combinations for all production groups are available in the paper analysis archive.
We also present a dashboard of quantitative evaluation plots for 4 of our 25 production groups in Figs. 5 and 4 to provide a visual interpretation of the four in-sample perfor-mance categories. Each dashboard is organized to address the following questions: -(a) For a given group, do the three representative responses span the range of sites? In this plot, individual site yield changes for each test (blue dots) are overlaid with the emulated mean, high, and low response functions evaluated for each test (black dots). Each horizontal line of points represents one of the 99 CTW sensitivity tests. Figure 4 displays one performance dashboard from each in-sample performance category for the broadly optimal, shared functional form cubic µ CTW and quadratic σ CTW (Eqs. 4-6), to aid interpretation of Table 1 (and Tables A1-A9).
As indicated in Table A10, any production group can be fit to result in response functions with an in-sample performance of good or excellent, if a user is willing to vary the functional forms used for each production group. Figure 5a presents the dashboard for one of the production groups that featured poor performance when the common functional form cubic µ CTW and quadratic σ CTW (Eqs. 4-6) were used for all production groups: rain-fed sugarcane in the extended tropics. Figure 5b presents the dashboard when the response functions are based on the production-group-specific functional forms selected by cross-validation (Table A10): C3MP µ CTW (Eq. A2) and cubic σ CTW (Eq. A7). The high and low response functions perform better in the latter case, though it is at the cost of a slightly worse (but still excellent) mean response function performance. Examination of the sugarcane entry in Tables 1, A1-A9 indicates that a cubic description of σ CTW (Eq. A7) leads to better high and low response function performance than a quadratic representation (Eq. A6), regardless of functional form used for µ CTW (Eqs. A1-A5). In other words, the uncertainty across C3MP site responses for each CTW test requires a more detailed Taylor series approximation to describe. This is also generally the case for the other production groups that rated adequate or poor insample performance in Table 1: sometimes, the C3MP individual site yield responses are distributed in such a way for each CTW test that a more flexible fit for σ CTW is necessary. Perhaps unsurprisingly, this usually occurs for either very broad production groups (such as those based on C 3 photosynthesis) or for production groups with very few C3MP site outputs (irrigated rice in the midlatitudes) rather than due to a discernible biophysical trend or requirement.

Production groups with small sample size
It is worth noting that 7 of the 25 production groups considered here are characterized by fewer than 10 C3MP sites (Table 1). For all of these groups, it is possible to fit high and low response functions that capture the spread of the group's C3MP site responses well (Figs. 6 and 7). For many of these groups, the spread in response is relatively small. The Persephone framework does not fail; rather, the data upon which the v1.0 response functions are trained are imperfect and would be improved by greater density in spatial sampling.
Had the spatial disaggregation used in forming production groups resulted in small sample size groups with more significant spread in site response, the Persephone framework is unlikely to represent the full spread of the sample. As this is not the case, it is left to an eventual user to judge whether such responses serve their purpose. Figure 6 highlights this fact for the production group with smallest sample size: irrigated soybeans in the extended tropics. The spread of C3MP sites as well as the performance dashboards for the shared optimal functional form (as from Table 1) and for the group-specific optimal functional form (Table A10) are available in the paper analysis data set . While the shared optimal functional form ( Fig. 6b) overestimates the small spread between the two C3MP sites, the group-specific optimal functional form ( Fig. 6c) captures the spread well. Figure 7 repeats this analysis for the next three smallest sample size groups: rain-fed wheat in the extended tropics (top), rain-fed rice in the midlatitudes (middle), and irrigated rice in the midlatitudes (bottom). In all three cases, the groupspecific optimal functional form represents the spread of the data well. This is also the case for the two remaining production groups with fewer than 10 C3MP training sites: irrigated wheat in the extended tropics and rain-fed soybeans in the extended tropics (not shown).

Qualitative
One motivation for the 25 production groups based on combinations of crops (maize, wheat, rice, soybeans, C 3 , C 4 (minus sugarcane), and sugarcane), irrigation technology (irrigated or rain-fed), and latitude band (extended tropics or  Fig. 5a). Vertical error bars indicate the 95 % credible interval for each of mean, high, and low emulated responses. midlatitudes) is to evaluate emulator performance beyond the quantitative. Given that some GCAM users will only be interested in the mean response functions, it is particularly important to validate that these functions capture key biological features of each crop, beyond the quantitative agreement for the 99 C3MP tests measured by the in-sample performance metric in Sect. 3.1. In particular, these are features motivated by biophysical intuition and present in most of the C3MP sites. Therefore, we verify that these features are retained in the emulator.
We use impact response surfaces to visualize these features, examples of which are given in Figs. 8 and 9. The three-dimensional CTW space is most easily examined by looking at cross sections where one of the CTW dimensions   (Table 1) for this production group. (c) The performance dashboard of the group-specific optimal functional form (Table A10) for this production group.
is kept constant while the other two vary. The brown to blue color bar in each of these figures depicts contours for the value of the mean yield response (µ CTW ), while the overlaid labeled black lines depict contours representing uncertainty (σ CTW , used to create the high and low response functions).
We first identify three important relationships we would expect a successful emulation of C3MP mean responses (brown to blue color bar) to obey: -C 3 crops respond strongly and positively to increases in global CO 2 concentrations; C 4 crops have noticeably less benefit from [CO 2 ] increases. -Agriculture in the tropics tends to respond more negatively/less positively to changes in temperature than agriculture in the higher latitudes as the extended tropics correspond to a higher baseline temperature.
-Irrigated crops have almost no response to changes in precipitation, whereas rain-fed crops do.
These benchmarks are met: Fig. 8 features impact response surfaces that highlight the C 3 -photosynthesis and C 4photosynthesis difference, the rain-fed and irrigated difference, and the latitude difference. The full collection of impact response surfaces for all production groups is included in the paper analysis archive. These benchmarks for the mean response are met in those as well. When there are exceptions, we have investigated to find that the mean response function is faithfully representing the underlying C3MP data and that it is the sampling of C3MP sites making up the production group responsible for the discrepancy. Note that, in Fig. 8, uncertainty is greatest in the [CO 2 ]-precipitation and [CO 2 ]temperature slices, and it increases with larger changes from the baseline condition. This follows with current practices for the process-based crop models forming the C3MP data set: [CO 2 ] is clearly related to yields but the details of this relationship are highly uncertain and implemented differently across process-based, site-specific crop models.
The pattern of yield response to CTW changes appears to be more qualitatively consistent across C3MP sites than the quantitative differences across sites (for example, Fig. 3). Figure 9 displays this pattern for one cross section of CTW space for 12 of 66 rain-fed maize sites in the midlatitudes and for the emulated mean response. While the actual numerical values of the response surfaces differ at each site, the pattern of response seen at most sites (increasing yield with high [CO 2 ] and low temperature changes in the upper left, decreasing yields elsewhere) is consistent and shared by the emulated mean response. The high and low response functions are able to capture much of the quantitative spread in site responses, though, as noted in Sect. 2.3, not the most extreme sites. We specifically included the sites in Ames, Iowa, USA; Naousa, Greece; and Lublin, Poland, because they feature the most qualitatively different patterns. The patterns at the 54 sites not displayed closely resemble those of the other nine sites in Fig. 9. This pattern is seen in the broader impact response surfaces literature Pirttioja et al., 2015;Fronzek et al., 2018) as well, further improving confidence in the emulated mean response. All individual site impact response surfaces are included in the paper analysis archive. Figure 10 demonstrates the basic procedure followed in using Persephone within GCAM (using the average of 2071-2100 HadGEM2-ES RCP8.5 projections as an example). The first requirement is a global gridded file of local precipitation and local temperature drawn from climate projections, along with a global CO 2 concentration level. Temperature and precipitation changes are calculated only for the relevant local growing season months in comparison to a 1980-2009 baseline value. The different maps of local temperature and precipitation changes on the left side of Fig. 10 reflect that there are differences in the dates of the local growing season for rainfed maize and wheat. Note that this includes a global CO 2 concentration of 812 ppm, compared to the baseline level of 360 ppm. The [CO 2 ] change alone leads to increased yields for rain-fed wheat in the midlatitudes even in the absence of changes in temperature and precipitation. Indeed, the higher [CO 2 ] elevates yields (compared to the baseline) across all but the most extreme hot and dry conditions. Conversely, the yield response for rain-fed tropical maize is barely helped by elevated [CO 2 ].

Applications
In a typical RCP8.5 scenario, there are sometimes a few grid cells with local precipitation changes that are out of sample. We convert these out-of-sample points to the extreme of our sample so that we avoid extrapolation (e.g., a 74 % local increase in precipitation gets the response of 50 % increase in precipitation -the maximum response to increased precipitation). Note that many of these large percentage changes in precipitation are actually the symptoms of ESM biases or small precipitation changes in arid regions that are unlikely to have agriculture. Holding to 50 % precipitation change likely improves the fidelity of these estimates .
The second step in using Persephone for GCAM is that CTW changes for each grid cell with climate data are passed into the Persephone v1.0 response functions (depending on species/management/latitude zone) to create the desired global gridded map of yield changes that would represent the likely agricultural response. The abrupt changes in behavior across 30 • N and 30 • S (particularly noticeable for wheat in southern Asia) are due to our division of training data into midlatitudes and extended tropics production groups. Those abrupt changes will soften as these impacts are aggregated to the larger GCAM land region level before being applied as multipliers in the experiments outlined in Fig. 1. Figure 11 presents the rain-fed maize impact response surfaces and yield change maps for the bias-corrected Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) entry of HadGEM2-ES RCP8.5 (Warszawski et al., 2014) 2071-2100 average CTW changes (displayed in Fig. 10) for the low (left), mean (center), and high (right) response functions. The high and low response surfaces result from adding or subtracting the gray uncertainty contours to or from the brown-blue mean yield response contours in the mean response surfaces (Eq. 8). Note that, under the high response function, there are a few regions that experience increased yields due to large increases in precipitation offsetting temperature increases. The differences in these three response functions will allow the boundaries of crop response uncertainty to be run through GCAM, resulting in a spread of so-   cioeconomic and environmental impacts in response to a particular future climate.

Comparison to other crop modeling results
We further examine the Persephone v1.0 response functions driven by HadGEM2-ES RCP8.5 CTW changes by comparing our results with previous AgMIP GGCM yield change data released under ISIMIP Warszawski et al., 2014). In order to compare the best possible emulation of the C3MP data set to the range of Ag-MIP/ISIMIP GGCM results, the production-group-specific optimal functional forms provided in Table A10 are used here. To have the most direct comparison possible, the ISIMIP GGCM yield time series were converted from actual yield values to percent changes from 1980 to 2009 baseline yields. It is important to note that the GGCMs were driven by historical climate data from 1980 to 2004. The years 2005-2009 yield data for each GGCM that was driven by HadGEM2-ES RCP8.5, given that this was considered a "future" simulation according to the GCM projections from 2005 forward. The results from the GGCMs which include model-specific [CO 2 ] effects were used. Both Persephone v1.0 and the ISIMIP GGCM yield change data are compared only on grid cells with harvested area greater than 10 ha in the SPAM 2005 data set (You et al., 2014). Figure 11. Low, mean, and high response surfaces for the midlatitudes (top row) and extended tropics (middle row) for rain-fed maize, as well as the resulting maps of yield changes under the same HadGEM2-ES RCP8.5 2071-2100 CTW changes as Fig. 10.
As the ISIMIP GGCMs did not directly participate in the C3MP exercise, no version of these GGCMs was used in the training data that produced the Persephone v1.0 response functions, and there is no a priori reason to expect the Persephone v1.0 range of yield changes to match the ISIMIP range. The site-specific simulations using various versions of DSSAT submitted to the C3MP exercise feature different configurations and model versions than the ISIMIP GGCM pDSSAT (a global gridded implementation of DSSAT). Given this fact, and that the C3MP archive includes results from non-DSSAT site-specific crop models, there is again no expectation of replicating pDSSAT results even though the fundamental crop responses are similar. Finally, it is also worth noting is that the 1980-2009 historical/RCP8.5 HadGEM2-ES simulation is not the same as the historical, site-specific, and AgMERRA data used by modelers submitting to C3MP. This combination of different responses and different baselines across C3MP and the ISIMIP GGCMs means there could be considerable differences in interannual variability and mean yields, which may be a reason that the Persephone v1.0 response functions may predict different yield changes from the ISIMIP range for some crops.
However, it is still worth evaluating our results against the GGCM data. Figure 12 compares the range of aggregated (via MIRCA2000 harvested area; Portmann et al., 2010), time-averaged 2071-2100 yield changes from Persephone v1.0 response functions to the range of ISIMIP yield changes at the global level (Fig. 12a), in the extended tropics latitude band (Fig. 12b), and in the midlatitude band (Fig. 12c) for both irrigated and rain-fed maize, rice, soybeans, and wheat. For context, in the time since the AgMIP/ISIMIP results were published, the IMAGE-LEITAP model has been largely abandoned. Further, IMAGE-LEITAP, the Lund-Potsdam-Jena General Ecosystem Simulator with Managed Land (LPJ-GUESS), and the Lund-Potsdam-Jena managed Land Dynamic Global Vegetation and Water Balance Model (LPJmL) feature relatively unlimited nutrient constraints, resulting in frequent yield increases given an unconstrained CO 2 response. For many of the production groups, the range of Persephone v1.0 yield changes lies at least partially within the ISIMIP range, suggesting that the response functions for those production groups result in yield changes consistent with ISIMIP. Those production groups that differ substantially from the ISIMIP yield range are due to underlying differences in the C3MP data set versus those produced from the ISIMIP GGCMs. That is, while the Persephone framework emulates the C3MP data well, response functions based on a different data set may behave more consistently with the ISIMIP GGCMs given differences between the model selection and local farm system configurations of the C3MP and GGCM ensembles.
Of the production groups with yield ranges much smaller than the range of ISIMIP yield changes, several (irrigated and rain-fed soybeans in the extended tropics and irrigated and rain-fed rice in the midlatitudes) are small sample size groups (Table 1, Sect. 3.1.1). Future coordinated sensitivity studies of site-specific crop models would ideally include more participation in a broader range of regions, but this is a current limitation of the Persephone v1.0 response functions. This adds additional support to the call for a designed network of site-based crop models, intended to cover all regions and systems, to participate in coordinated sensitivity studies raised in Ruane et al. (2017).
The Persephone v1.0 range of yield changes for irrigated maize also noticeably departs from the ISIMIP range of yield changes in the midlatitudes (Fig. 12c), which in turn drives the disagreement at the global level (Fig. 12a). This is not due to an error in emulation or due to small sample sizes (Table 1, Fig. 7, bottom) but rather due to a fundamental disagreement in the predicted maize response among the sitespecific crop models of C3MP and the ISIMIP GGCMs. It is worth noting that yield changes predicted by Persephone v1.0 response functions are consistent with work examining maize site data. Namely, a 2014 site-specific model comparison study by the AgMIP-Maize team found irrigated and rain-fed maize yield changes in response to a local temperature +6 • C of similar values to the Persephone v1.0 range of responses (see Fig. 3 of Bassu et al., 2014). The HadGEM2-ES RCP8.5 local growing season temperature 2071-2100 change map for irrigated maize used to drive the Persephone v1.0 response functions is shown in Fig. 13, and it is worth noting that many major producers of maize see temperature increases of at least +6 • C. Further, recent analysis of freeair carbon dioxide enrichment (FACE) experiments and crop model results suggests that maize primarily benefits from high [CO 2 ] during drought, indicating that models of the effects of [CO 2 ] fertilization on irrigated maize (and rain-fed maize during non-drought periods) may be overly beneficial (Durand et al., 2018). This suggests that the more pessimistic irrigated maize yield changes predicted by the C3MP sites and therefore Persephone v1.0 are more consistent with sitespecific crop models and FACE experiments than they are with the ISIMIP GGCM range of results. While it would be ideal to have GGCM results from more GGCMs and more recent model versions for comparison, such results are not yet public. This discrepancy between the results of site-specific crop models and FACE experiments versus GGCMs supports the call in Leakey et al. (2012) for further investigation to understand regional and system-specific variation in [CO 2 ] response. Figure 14 includes a spatial comparison of the Persephone v1.0 low, mean, and high yield changes for irrigated maize (analogous to Fig. 11) with the range of ISIMIP responses in each grid cell. Specifically, maps of the minimum, median, and maximum irrigated maize yield change across the ISIMIP GGCMs are plotted in each grid cell; no individual GGCM would produce any of these maps. As noted above, the Persephone range of yield changes in each grid cell is generally more pessimistic than the ISIMIP range, but there does appear to be spatial consistency in terms of response strength in several regions between the Persephone v1.0 range and the ISIMIP range. C3MP, and therefore the Persephone v1.0 projections, capture a strong temperature dependence and a lesser response to precipitation (particularly for irrigated crops). Because warmer temperatures are nearly universal in the HadGEM2-ES RCP8.5 projection (Fig. 13), there is limited irrigated crop response to precipitation changes, and [CO 2 ] response for maize is small among the mechanistic models that are more prominent in C3MP than in the GGCMs; there is nothing but yield decreases in the Persephone projection. In the ISIMIP range of GGCMs, there are models that are more positively responsive to precipitation and in the C 4 -photosynthesis maize crop, so wetter conditions and/or higher [CO 2 ] are much more beneficial in the ISIMIP maximum map . Figure 15 presents the same analysis for a production group that Persephone v1.0 matches the ISIMIP global average range well: rain-fed wheat. For reference, the HadGEM2-ES RCP8.5 local growing season temperature and precipitation projections for rain-fed wheat are included in Fig. 10b. Again, there is noticeable spatial consistency in response strength between the Persephone v1.0 range and the ISIMIP range. For wheat, the C3MP models and therefore the Persephone v1.0 projections are in closer agreement with the ISIMIP GGCMs on C 3 -photosynthesis [CO 2 ] response and water limitations in many regions. Additionally, the harvested area mask used for rain-fed wheat includes many more regions that are limited by cool baseline temperatures and thus stand to gain from warmer conditions than the regions considered for irrigated maize. Put together, these observations indicate that both Persephone v1.0 and ISIMIP are capable of the large gains in the optimistic maximum model response scenario. Together with Fig. 14, this suggests that the Persephone v1.0 response functions are spatially consistent with the ISIMIP range of yield changes even when the global average ranges may disagree.

Conclusions and discussion
We have presented the Persephone emulator framework that results in three v1.0 response functions to emulate a range of crop yield changes in response to future CTW changes for 25 production groups. The response functions are inexpensive to evaluate, open doors to new feedback loops between society and the natural environment (Fig. 1), and represent multiple models and farming systems. The Persephone v1.0 response functions agree well with the underlying C3MP training data and are rapid to evaluate, with in-sample performance metrics being particularly strong for the mean response in each production group. The rapid evaluation time of the response functions, relative to a global gridded crop model, is extremely important given that models such as GCAM are designed to be run rapidly to trace the impacts of future scenarios (at most hours per scenario). The GCAM model development team prioritizes staying on this order of computation time, even for the planned experiments outlined in Sect. 2.1, because it results in a nimble, flexible model  that allows multiple iterations for probability, uncertainty, and process understanding. In addition to the good quantitative agreement of our response functions with all C3MP crop-irrigation-latitude ensembles, we further evaluated our mean response function heuristically, finding that the mean responds to changes in CTW as one would expect for comparisons across C 3 /C 4 photosynthesis mechanisms, rain-fed versus irrigated management, and latitude zones. Finally, the range of v1.0 yield changes were evaluated against a variety of past global gridded crop modeling, site-specific crop modeling, and/or empirical studies for many of the production groups and found to be consistent.
As a result of the culling methods outlined in Sect. 2.2, 575 C3MP sites are used for training the Persephone func-tions. These sites account for many major crops where they are typically grown, as well as a wider variety of crops than has been examined in past studies. One key observation is that, if one were only concerned with capturing the mean response, any of the functional forms examined for µ CTW (Eqs. A1-A5) in the Appendix A would be excellent, with all five forms featuring in-sample NRMS < 0.2 for all production groups (Table 1). The challenge is in defining a pair of response functions, µ and σ , to characterize a range of uncertainty across C3MP site responses to each CTW change. It should also be noted that such a range of uncertainty will capture only a portion of the uncertainty in response in national and multi-national GCAM units. The Persephone framework may be used with future more spatially dense data sets to characterize this uncertainty more fully.
The modeling choices made in this study introduce a variety of caveats. Foremost, it is likely that future versions of Persephone response functions, trained on different data sets, will almost certainly result in different response functions. Yet, this work has shown that the Persephone framework is well suited for this kind of problem, and that the v1.0 response functions developed from the C3MP data emulate those data well. They also perform reasonably well on heuristic metrics and in comparison to other crop modeling efforts. Another important caveat is that GCAM operates on 5-year time steps. Therefore, the response functions in this work only characterize yield responses to long-term, local Earth system state changes. Capturing interannual variability and responses to abrupt weather shocks is an area that may form future phases of this research. We note that this is a more difficult task, given that year-to-year variability depends on many factors that tend to average out over longer terms (e.g., intra-seasonal variability such as heat waves or dry spells). Additionally, this work did not account for differing nitrogen application rates across different C3MP sites. Nitrogen data are included in the C3MP archive, but the sites are heavily biased to high nitrogen application (this is likely a function of the most commonly simulated sites also being systems with higher input investment). There are also a number of sites with no recorded nitrogen information, which were kept for this study. With so few sites featuring low ni-trogen application rates, we considered examining the nitrogen dimension of yield responses to be its own intellectual challenge reserved for future work, the methods of which will likely be determined by the desired use. Similarly, exploration of forming production groups based on different crop groups, different latitudinal zones, Köppen-Geiger, or temperature zones would require trivial changes, limited only by the number of sites available to sort into different production groups. Finally, it is worth noting that any emulator is only as good as the data upon which it is trained. If crop modeling studies that provide data to an emulator do not account for real-world behaviors, the emulator will not capture such behaviors either.
For clear analysis in this paper, we have presented results for the functional form combination that performed best at the cross-validation experiments described in the Appendix A for the most production groups. Therefore, one remedy to the presence of ensembles with poorer emulator performance on in-sample metrics (Table 1) would be to use different functional forms for each production group to create a more globally optimal set of response functions. These are laid out for each production group in Table A10, along with the in-sample performance of the group-specific optimal functional forms. Some analysis with these productiongroup-specific optimal models is included in Sects. 3.1.1 and 4.1. The data processing, emulator fitting, and analysis techniques presented in this paper are agnostic of the actual functional forms used for µ CTW and σ CTW as long as they are A. Snyder et al.: Persephone v1.0 linear in parameters. Varying functional form by production group will only require different inputs to the Persephone R functions, not refitting of any parameters.
The most immediate future work involving Persephone v1.0 will be to fully implement the feedback loop sketched in Fig. 1. Specifically, using GCAM to examine the broad impacts of a sustained drought, hypothetical or emergent from the feedback loop sketched in Fig. 1, would be an excellent application of this yield change emulator. Once the illustrated links have been implemented and full runs of the loop have been timed, future development may take place. In addition to the exploration of the nitrogen dimension of yield response and allowing response functional form to differ by production group, Persephone version 2 may incorporate other predictors as data are available, explore more dynamic feature selection algorithms for functional form selection for µ CTW and σ CTW such as L1 regularization (which favors sparse models), and/or be trained with data sets that may be released in the future. Which of these is explored next will depend on the outcomes of the initial full feedback loop studies with GCAM. This study represents the first vital, necessary step in better identifying a pathway in which society can develop with balanced consideration of the natural environment and managed environments like agriculture through connecting Persephone and GCAM.
Code and data availability. Software implementing this technique is available as an R package released under the GNU General Public License. Full source can be found in the project's GitHub repository (https://github.com/JGCRI/persephone and https://doi.org/10. 5281/zenodo.1415487;Snyder, 2018). Release version 1.0.0 of the package was used for all of the work in this paper.
The data and analysis code for the results presented in this paper are archived at https://doi.org/10.5281/zenodo.1414423 (Snyder et al., 2018).

Appendix A: Model selection and performance
We fit the likelihood presented in Eq. (1) with five different functional forms for µ CTW (Eqs. A1-A5) and two different functional forms for σ CTW (Eqs. A6 and A7), resulting in data from a total of 10 emulator models (each with different likelihoods based on µ CTW , σ CTW ) to compare to the C3MP data set.
The five functional forms for µ were selected intentionally. The first (Eq. A1) is a second-order Taylor polynomial approximation of mean yield response. Equation (A2) is the functional form for mean response used in Ruane et al. (2014), differing from the second-order Taylor polynomial by only one third-order CTW interaction term, a 10 . Equations (A3) and (A4) continue to build up from the secondorder Taylor polynomial, examining the impacts of adding third-order CTW interaction terms and the impacts of adding pure third-order CTW terms, respectively. Finally, Eq. (A5) is the full third-order Taylor polynomial, a flexible approximation for many complicated functions. The two functional forms for σ (Eqs. A6 and A7) are simply the second-and third-order Taylor polynomials: C3MP: µ CTW = a 1 T + a 2 ( T ) 2 + a 3 W + a 4 ( W ) 2 + a 5 C + a 6 ( C) 2 + a 7 T W + a 8 T C + a 9 W C + a 10 T W C (A2) cross: µ CTW = a 1 T + a 2 ( T ) 2 + a 3 W + a 4 ( W ) 2 + a 5 C + a 6 ( C) 2 + a 7 T W + a 8 T C + a 9 W C + a 10 T W C + a 11 ( T ) 2 W + a 12 ( T ) 2 C + a 13 T ( W ) 2 + a 14 T ( C) 2 + a 15 ( W ) 2 C + a 16 W ( C) 2 (A3) pure: µ CTW = a 1 T + a 2 ( T ) 2 + a 3 W + a 4 ( W ) 2 + a 5 C + a 6 ( C) 2 + a 7 T W + a 8 T C +a 9 W C + a 10 ( T ) 3 + a 11 ( W ) 3 + a 12 ( C) 3 (A4) cubic: µ CTW = a 1 T + a 2 ( T ) 2 + a 3 W + a 4 ( W ) 2 + a 5 C + a 6 ( C) 2 + a 7 T W + a 8 T C We selected the model presented in the paper from the 10 combinations above based on leave-one (CTW test)-out cross-validation experiments to estimate out-of-sample prediction error for each production group. We do also include the in-sample performance metric defined in Sect. 3.1 for a more complete picture of model performance for all 10 functional form combinations for all 25 production groups (Tables A1-A9).
First, to test each model's validity and robustness at predicting yield changes for CTW values not included in the training data for each group, we ran leave-one-out crossvalidation experiments (Gelman et al., 2014) to analyze the performance of each model for each production group. For each group separately, one CTW test datum was withheld and the model was fit on the remaining 98 CTW tests. Then, the mean, high, and low response functions resulting from the model were evaluated on the C3MP site data for the withheld test. This process was repeated withholding each CTW test, and the results were averaged resulting in an RMSE measure of performance for each of the mean, high, and low response functions. Leave-on-out cross validation used in this way answers the question: for a particular production group and model, on average, how do the emulated (mean, high, and low) yield changes compare against the C3MP (mean, high, and low) yield changes for CTW values not in the training set?
The Latin hypercube design of the C3MP sensitivity tests lends confidence to this leave-one-out exercise because the cross validation has covered the full space of CTW combinations. The results are summarized in Fig. A1: each row represents the average leave-one-out cross-validation RMSE measures for each functional form across all production groups for the high, low, or mean response functions, and then the average across all three (total, bottom row Fig. A1). We find that cubic µ CTW , quadratic σ CTW performs the best at this cross-validation experiment for the highest number of ensembles across the three response functions we defined in Eq. (8) (that is, the high, low, and mean response functions). We repeat these calculations for each production group separately (rather than averaging across production groups) to determine the group-specific optimal functional form, listed in Table A10 for each group.
Because cubic µ CTW , quadratic σ CTW performs the best at out-of-sample error measurements for the highest number of ensembles across mean, low, and high response functions, and is quite good (though not the best) at in-sample error measurements (Table 1), this is the form used throughout the body of the paper as the most broadly optimal functional form combination. We particularly value performance on the cross-validation (out-of-sample error) experiments because most CTW changes that may arise in application are likely to differ from the 99 C3MP tests. We also repeat the in-sample measurement of error presented in Sect. 3.1 for all functional form combinations. These results are summarized in Tables A1 to A9, and we find that, purely based on the in-sample measurements, cubic µ CTW , cubic σ CTW (Table A9) is the best functional form for the most production groups. Specifically, it only performs poorly for one crop: rain-fed wheat in the midlatitudes. However, it is very poor for that important ensemble. The in-sample performance information from these tables is included in Table A10 for each production-group-specific optimal functional form combination.
Appendix B: C3MP baseline yield estimate functional forms As mentioned in Sect. 2.1.1, the eight different functional forms used to relate site-specific output yield in response to input CTW values are presented here in Eqs. (B1)-(B8). Each functional form was used with each specific C3MP site's data in order to provide a best estimate of baseline yield for that site. The form with the smallest root mean square er- ror across the 99 tests for the site is the one used to provide a best estimate of baseline yield. This best estimate of baseline yield is used to convert the C3MP output yields at the site to percent changes in yield from baseline for emulator training.