Validation of terrestrial biogeochemistry in CMIP6 Earth system models: a review

. The vital role of terrestrial biogeochemical cycles in inﬂuencing global climate change is explored by modelling groups internationally through land surface models (LSMs) coupled to atmospheric and oceanic components within Earth system models (ESMs). The sixth phase of the Coupled Model Intercomparison Project (CMIP6) provided an opportunity to compare ESM output by provid-ing common forcings and experimental protocols. Despite these common experimental protocols, a variety of terrestrial biogeochemical cycle validation approaches were adopted by CMIP6 participants, leading to ambiguous model performance assessment and uncertainty attribution across ESMs. In this review we summarize current methods of terrestrial biogeochemical cycle validation utilized by CMIP6 participants and concurrent community model comparison studies. We focus on variables including the dimensions of evaluations, observation-based reference datasets, and metrics of model performance. To ensure objective and thorough validations for the seventh phase of CMIP (CMIP7), we recommend the use of a standard validation protocol employing a broad suite of certainty-weighted observation-based reference datasets, targeted model performance metrics, and comparisons across a range of spatiotemporal scales.


Introduction
The terrestrial biosphere is presently responsible for sequestering about a quarter of anthropogenic carbon emissions, substantially reducing the severity of ongoing climate change . The future capacity of the terrestrial biosphere to sequester CO 2 emissions is uncertain due to non-linear feedbacks such as CO 2 fertilization, growing season extension in cold-limited regions, enhanced heterotrophic respiration, and potentially other feedbacks, as well as environmental and physiological constraints such as moisture availability, nutrient limitations, and stomatal closure (Fleischer et al., 2019;Green et al., 2019;Xu et al., 2016;Wieder et al., 2015). Earth system models (ESMs) are a means to simulate past, present, and future terrestrial biogeochemical cycles, examine the influence of changes in climate and atmospheric CO 2 concentration on CO 2 uptake, explore feedbacks and limitations, and estimate anthropogenic carbon emissions compatible with avoiding a given threshold in global temperature change. ESMs simulate global exchanges of matter and energy through the coupling of land, atmospheric, and oceanic components. Through concerted efforts, successive generations of ESMs have improved in terms of spatiotemporal resolution, complexity, and process representation (Anderson et al., 2016). Despite this progress, terrestrial biogeochemical cycles remain a major source of uncertainty in future climate projections Lovenduski and Bonan, 2017). This uncertainty stems from limited process understanding, lacking observational constraints, inherent cycle variability, temporal discrepancy between forcings and responses (Sellar et al., 2019;Ciais et al., 2013), and uncertain stock quantifications (Ito et al., 2020;Wieder et al., 2015), which together compound uncertainty within models. Among models, this uncertainty is amplified by artefacts in the form of inconsistent model structure, boundary conditions, forcing datasets, experimental protocols, and benchmarking observational datasets, which is magnified by the increasing number, diversity, and complexity of ESMs . Subsequently, a study on uncertainty in projected terrestrial carbon uptake based upon 12 Coupled Model Intercomparison Project phase 5 (CMIP5) ESMs indicated that uncertainty stemming from model structure may be 4 times greater than uncertainty from different emission scenarios and internal variability (Lovenduski and Bonan, 2017). Some progress has been made in addressing the large uncertainty associated with the terrestrial biogeochemistry in ESMs, as comparison of the carbon-climate and carbon-concentration feedback among ESMs participating in the sixth phase of CMIP (CMIP6) by Arora et al. (2020) shows a reduced model spread amongst models that included a nitrogen cycle, which provided a realistic constraint on photosynthesis in the context of elevated atmospheric CO 2 concentration. However, the spread in estimated feedback parameters across ESMs overall has not been significantly reduced from CMIP6 relative to CMIP5 (Arora et al., , 2013).
To answer scientific questions regarding climate change, the CMIP was initiated in 1995 by the World Climate Research Programme's (WCRP) Working Group of Coupled Modelling (WCRP, 2020). The CMIP designates standard experimental protocols, model output formats, and model forcings to diagnose climate change variability, predictability, and uncertainty following various scenarios within a multi-model framework. CMIP6 began in 2013 with 3 years of planning and community consultation to address knowledge gaps, prior to the conduction of simulations and analyses in 2016 and onwards. Model validation in the context of CMIP consists of demonstrating sufficient agreement between model output data and historical observation-based reference data following model development and is a crucial process in model advancement. Such comparison facilitates model improvement by identifying model limitations in performance or sources of model-data uncertainty (Lovenduski and Bonan, 2017) and informs the weighting of different ESMs in influencing climate projections and policy (Eyring et al., 2019). CMIP6 specified detailed experimental protocols for modelling group participants to facilitate objective comparisons of the output of different models with common forcings (Eyring et al., 2016a).
Here we focus on validations of the stocks and biological fluxes of fully coupled ESMs and associated land surface model (LSM) releases from 2017 onwards with explicit terrestrial biogeochemical cycle representation contributed by participating CMIP6 modelling groups (hereafter participants; Table 1; Arora et al., 2020). Validations are analyzed in terms of variables included, spatiotemporal scales, reference datasets, and metrics of performance. Section 2 compares the methods of historical terrestrial biogeochemical cycle validation used by participants. Section 3 summarizes the methods used in community analyses of CMIP5 era models and provides a critique of these methods. A future outlook is presented in Sect. 4.

Participant methods of validating terrestrial biogeochemical cycles
To participate in CMIP6, participants had to submit four Diagnosis, Validation, and Characterization of Klima (DECK) experimental simulations, which included a control simulation with prescribed idealized pre-industrial (1850) forcing for at least 500 years to demonstrate stability in global climate and biogeochemical exchanges. Additionally, participants had to conduct historical simulations from 1850 to 2014 using designated CMIP6 forcings (available at https:// esgf-node.llnl.gov/search/input4MIPs/, last access: 8 February 2021) and initialization from the pre-industrial forcing control run (Eyring et al., 2016a). Each modelling group demonstrated stability in the global carbon cycle, with global net carbon exchange below the suggested limit of 0.1 Pg C yr −1 by Jones et al. (2016), while no suitable preindustrial simulation global nitrogen or phosphorus flux was specified for CMIP6, though these were generally below 2.0 Pg yr −1 (Ziehn et al., 2020). Each modelling group validated terrestrial biogeochemical cycle components for the historical simulation in a unique fashion, which is summarized below and detailed in Appendix A.

Variables included in validations
The number of terrestrial biogeochemical cycle variables evaluated against observation-based estimates by participants varied considerably (from 0 to 21), with a total of 38 unique variables evaluated by all participants combined. The variable validated most often was gross primary production (GPP), which was validated by all but one participant. The next nine most validated variables in descending order were soil carbon, the global land carbon sink, leaf area index (LAI), vegetation carbon, ecosystem respiration, global land-atmosphere CO 2 flux, surface CO 2 concentrations, total biomass, and burned area (Fig. 1). For a list of variable definitions, see Table 2. The majority of variables were validated by just one or two participants (Fig. 2). Danabasoglu et al. (2020) and Lawrence et al. (2019) validated a relatively extensive suite of variables with the International Land Model Benchmarking (IL-AMB) package version 2.1 (ILAMBv2.1; Collier et al., 2018, Fig. 3), including an explicit uncertainty analysis of the influences of interannual variability, forcing datasets, and model structure in the form of prescribed versus prognostic vegetation phenology. While no nitrogen cycle variable was validated by more than one group, soil N 2 O flux and total N 2 O emissions were evaluated by Hajima et al. (2020) and Lawrence et al. (2019), respectively.
A variety of spatiotemporal scales of these variables were considered in validations both within and among participants. Spatial scales consisted of site-level, model grid cell, degree of latitude, region, and global scales, with the latter being the most common across participants. Temporal Gross primary production (GPP) The quantity of CO 2 removed from the atmosphere by vegetation.
Net primary productivity (NPP) The quantity of CO 2 removed from the atmosphere by vegetation minus the quantity of CO 2 from autotrophic respiration.
Autotrophic respiration (AR) The quantity of CO 2 from cellular respiration in plants.

Ecosystem respiration (ER)
The quantity of CO 2 from autotrophic respiration and heterotrophic respiration.
Heterotrophic respiration (HR) The quantity of CO 2 from cellular respiration by heterotrophs.
Net ecosystem production (NEP) The quantity of CO 2 removed from the atmosphere by vegetation minus the quantity of CO 2 from autotrophic and heterotrophic respiration.
Net biome production (NBP) The net rate of organic carbon accumulation minus autotrophic and heterotrophic respiration and nonrespiratory losses from disturbance.
Net ecosystem carbon balance (NECB) The net rate of organic carbon accumulation in an ecosystem (independent of scale).  scales included daily, seasonal, annual, decadal, select periods, and long-term trends, accumulations, or averages over the whole historical simulation period from 1850 to 2014. For more detail on the spatiotemporal scales of validation used by each participant, readers should refer to Appendix A.
Dynamic variables such as LAI were subject to a detailed assessment, including annual maximum and minimum magnitude  and month , seasonality (Ziehn et al., 2020), and seasonal average, as well as global averages. GPP was also evaluated across a variety of scales, including in terms of the daily, seasonal, and annual magnitude on a plant functional type (PFT), spatial, and global basis against site-level observations (Vuichard et al., 2019), as well as globally in terms of functional relationships with temperature and precipitation (Swart et al., 2019) and the relative contribution of drivers of variation (Vuichard et al., 2019). Biomass and carbon stock variables were evaluated in terms of spatial distributions or global averages over the chosen time periods, often on a decadal scale . Global vegetation and soil carbon turnover times were also evaluated for selected time periods (Delire et al., 2020;Lawrence et al., 2019).

Reference datasets
For variables which were validated by more than one modelling group, such as GPP, a variety of observation-based reference datasets were utilized. For example, across participants several different GPP reference datasets were used (Table 3), though most participants utilized model tree ensemble (MTE) machine-learning upscaled ground eddycovariance, meteorological, and satellite observation-based estimates of GPP from Jung et al. (2011). Interestingly, one group, the Centre National de Recherches Météorologiques (CNRM; Delire et al., 2020), used a more recent Fluxnetbased GPP dataset (FluxComv1; Jung et al., 2016; and further used the mean of 12 products therein. CNRM and the Institut Pierre Simon Laplace (IPSL, Vuichard et al., 2019) were the only groups to include a comparison to site-level GPP observations. A variety of reference datasets were also utilized for the second most frequently validated variable, soil carbon (Table 4), spanning a 12-year publication range (Batjes, 2016;Global Soil Data Task Group, 2002). Several participants used more than one reference dataset for evaluation of soil carbon depending upon regional or global focus, such as the Northern Circumpolar Soil Carbon Database provided by Hugelius et al. (2013) for mid-high latitudes, while global soil carbon estimates were obtained from Batjes (2016), , Todd-Brown et al. (2013), andFAO (2012). While biomass and carbon stocks were predominantly compared to present-day observations, Delire et al. (2020) used records from the Global Database of Litterfall Mass and Litter Pool Carbon and Nutrients database, which extends from 1827 to 1997 (Holland et al., 2015).

Statistical metrics of model performance
A variety of statistical metrics were used to quantify model performance in simulating historical variables in comparison   including three different climate forcing data products (individual columns) and two forms of model structure (column groups). CLM5SP denotes MODIS-prescribed (Zhao et al., 2005) vegetation phenology, while CLM5GBC denotes prognostic phenology. Climate forcing data products include WATCH/WFDEI from Mitchell and Jones (2005) to observations, though chosen metrics were more consistent than selected variables. The comparison of simulated and observation-based averages calculated over space and time was the most common metric used by all but two participants (Table 5). The next most commonly used metric was root-mean-squared error (RMSE), followed by bias (simulated − observed) on a spatial or global basis. Evaluations of global accumulations, seasonal phase, seasonal maximum and/or minimum, and global totals were also used. The Taylor diagram, which geometrically combines spatiotemporal correlation, standard deviation, and root mean square (rms) difference (Taylor, 2001), was used to summarize model performance by three participants Collier et al., 2018;Goll et al., 2017). The correlation coefficient (r) was also used by three participants (Swart et al., 2019;Mauritsen et al., 2019;Goll et al., 2017). RMSE normalized by the standard deviation of observations (NRMSE) was only used by Swart et al. (2019), while the coefficient of determination (r 2 ) was only used by Mauritsen et al. (2019). A targeted metric in the form of dissected mean squared deviation (Kobayashi and Salam, 2000), the sum of squared bias, squared difference between standard deviations, and lack of Table 3. The source for gross primary production (GPP) data referenced by each modelling group for ESM or LSM simulations. Adjacent contributions from the same modelling group are banded in a common fashion for readability. LSM-focused validations by each modelling group are presented with the associated ESM in brackets.   correlation weighted by standard deviation, was used to distinguish model sources of error by Vuichard et al. (2019). In addition to quantitative metrics, the qualitative aspects of simulations were compared to observational reference data, such as in demonstrating source or sink behaviour over time (Danabasoglu et al., 2020) or in visual comparison of spatial distribution maps.

Community methods of validating terrestrial biogeochemical cycles
A variety of software and projects have been dedicated to the communal evaluation of ESM (Gleckler et al., 2016) and LSM performance (Kumar et al., 2012;Gulden et al., 2008), with CMIP6-era collaborative efforts including the Earth System Model Evaluation Tool version 2 (ESMValToolv2.0; Eyring et al., 2016b) and ILAMBv2.1 (Danabasoglu et al., 2020;Lawrence et al., 2019;Collier et al., 2018). Both ES-MValToolv2.0 and ILAMBv2.1 are openly available tools for the evaluation of a variety of model output against reprocessed observations (https://www.esmvaltool.org/, https: //pypi.org/project/ILAMB/, last access: 1 May 2021; Eyring et al., 2020Eyring et al., , 2016bCollier et al., 2018). The observationbased reference datasets for each are displayed in Table 6. For the ESMValToolv2.0 dataset, re-processing for compatible comparison in space and masking of missing observations is detailed in Righi et al. (2020). The analysis of the land carbon cycle in ESMValToolv2.0  is based upon the approach of Anav et al. (2013) for considering long-term trends, interannual variability, and seasonal cycles. A variety of tailored model performance metrics are available with ESMValToolv2.0 . The relative space-time root-mean-square deviation (RMSD) indicates model success relative to the multi-model median in simulating the seasonal cycle of key variables that was originally detailed in Flato et al. (2013) and allows simultaneous comparison to more than one observational reference for each simulated variable (where available). ES-MValTool2.0's AutoAssess function provides a highly resolved model performance evaluation for 300 individual variables, originally developed by the UK Met Office. Further, land cover can be comprehensively evaluated with ESM-ValToolv2.0 in terms of area, mean fraction, and bias on a L. Spafford and A. H. MacDougall: Validation of terrestrial biogeochemistry in CMIP6 regional and global basis, accommodating different model representations of land cover. ILAMBv2.1 was used to validate terrestrial biogeochemical cycle components in CESM2 (Danabasoglu et al., 2020) and CLM5 ( Fig. 3; . ILAMBv2.1 was also used to demonstrate the absolute and relative performance of Dynamic Global Vegetation Models (DGVMs) within several iterations of the Global Carbon Project (Friedlingstein et al., , 2019Le Quéré et al., 2018). In addition to variables presented in Table 6, functional relationships between these variables and temperature and precipitation are provided for validation purposes in ILAMBv2.1. ILAMBv2.1 employs a weighting system to assign scores to observation-based datasets, which encompasses certainty measures, spatiotemporal-scale appropriateness, and process implications. In computing statistical model performance scores, ILAMBv2.1 acknowledges how reference observations represent discontinuous constants in time and space. For example, if a reference dataset contains average information across a span of years, the annual cycle of such a dataset is assumed to be undefined and is therefore not used as a reference. The calculation of averages over time in ILAMBv2.1 addresses spatiotemporally discontinuous data by performing calculations over specific intervals for which data are considered valid. For each variable evaluation, ILAMBv2.1 generates a series of graphical diagnostics, including spatial contour maps, time series plots, and Taylor diagrams (Taylor, 2001), as well as statistical model performance scores, including period mean, bias, RMSE, spatial distribution, interannual coefficient of variation, seasonal cycle, and long-term trend. These scores are then scaled based upon the weighting of reference observation-based datasets, and for multi-model comparisons they are presented across metrics and datasets to provide a single score.

Critique of validation approaches
While standard protocols were used by participants for historical simulations in CMIP6, no standard protocol in terms of variables evaluated, reference data, performance metrics, or acceptable performance threshold was adopted for terrestrial biogeochemical cycle validation. The validation of particular variables by different participants occasionally employed the same datasets, though in many cases inconsistent reference datasets were used for the same variable, and the spatial and temporal dimension of validations was often distinct. This contrasts with other works employing multiple models such as the Global Carbon Project (Friedlingstein et al., , 2019Le Quéré et al., 2018), which provides explicit validation criteria, such as simulating recent historical net land-atmosphere carbon flux within a particular range and within the 90 % confidence interval of specified observations. The stringency of such criteria must be carefully chosen to acknowledge the role of observational uncertainty and uncertainty stemming from potential model tuning to forcing datasets. The use of different validation approaches impedes the comparison of performance across models; however, it also provides a diverse collection of example methods.

Variable choice
A comprehensive validation of a process-based model should include all simulated interacting variables for which a reliable empirical reference is available. Improvement in the simulation of one variable through altered parameters, structure, or algorithms may translate into degradation for other variables, which would be otherwise obscured in a restricted variable analysis Ziehn et al., 2020;Lawrence et al., 2019). Given the scope of CMIP6 publications in demonstrating model improvements relative to previous versions and the results of CMIP6 experiments, it is understandable that most participants validated a few select variables and that more extensive validations may be in preparation. Essential climate variables (ECVs) prioritized for land evaluation in the ESMValToolv2.0 included GPP, LAI, and NBP (Eyring et al., , 2016b, as these variables intersect with other ESM components in matter and energy exchanges (Reichler and Kim, 2008). In contrast, LAI and NBP were not as frequently validated as GPP by CMIP6 participants ( Fig. 1), though the third most validated variable, the global land carbon sink, is equivalent to NBP minus land use emissions. The most common variable chosen for validation by participants was GPP, which is advantageous as it represents a crucial carbon cycle flux. GPP designates the quantity of CO 2 removed from the atmosphere and assimilated into structural and non-structural carbohydrates during photosynthesis by vegetation, part of which is later respired back to the atmosphere. This quantity is limited by nutrient availability, light, soil moisture, stomatal response to atmospheric CO 2 concentration, and other environmental factors (Davies- Barnard et al., 2020) and is the largest carbon flux between the land biosphere and atmosphere (Xiao et al., 2019). Over-or under-estimations of GPP can lead to biases in carbon stocks, which are exacerbated through time ). An emergent ecosystem property that integrates a variety of influential model processes is carbon turnover time calculated as the ratio of a long-term average total carbon stock compared to GPP or NPP Yan et al., 2017;Carvalhais et al., 2014). Carbon turnover times can be the source of pervasive uncertainty within ESMs, and their misrepresentation can lead to long-term drifts in carbon stocks, fluxes, and feedbacks (Koven et al., 2017). The evaluation capacity of turnover times was seldom utilized by CMIP6 participants, despite soil carbon being a relatively commonly validated variable. Many CMIP5 models were found to under-estimate turnover times both globally and on a latitudinal basis , while two participants here, Delire et al. (2020) and Lawrence et Table 6. Select observation-based reference dataset sources for ESMValToolv2.0  and ILAMBv2.1 , including net biome production (NBP), leaf area index (LAI), land cover (LC), gross primary production (GPP), net ecosystem exchange (NEE), soil carbon (SC), vegetation carbon (VC), ecosystem carbon turnover (ECT), vegetation biomass (VB), and burned area (BA). Note that vegetation carbon is dependent upon vegetation biomass.

Variables
ESMValToolv2  2008) BA - Giglio et al. (2010) al. (2019), reported over-estimated carbon turnover times despite demonstrating improvement from previous models. Another approach to validation that combines high-level variables and re-parameterization efforts is the assessment of functional relationships or emergent constraints, such as the relationship between GPP or turnover times and temperature, moisture, growing season length, and nutrient stoichiometry (Danabasoglu et al., 2020;Swart et al., 2019;Anav et al., 2015;McGroddy et al., 2004). Physically interpretable emergent constraints can aid in identifying model components that are particularly influential for climate projections (Eyring et al., 2019), such as the temperature control on carbon turnover in the top metre of soil in cold climates (Koven et al., 2017), GPP responses to soil moisture availability (Green et al., 2019), or regional carbon-climate feedbacks (Yoshikawa et al., 2008). With the goal of realistically simulating Earth system processes to develop informed predictions of future climate, large-scope variables that inherit uncertainty from an amalgamation of processes are often prioritized for validation. Several participants focused on comparing simulated long-term trends or accumulations in global land carbon fluxes to observation-based estimates from the Global Carbon Project (Friedlingstein et al., 2019;Le Quéré et al., 2018. While this summation approach can signal a large bias (Eyring et al., , 2016bReichler and Kim, 2008) and reduce the effect of sub-scale noise, it does not identify sources of model error or may even obscure model error. For example, if simulated land-atmosphere carbon flux from the pre-industrial era to the 2010s is found to concur with observation-based estimates, this could be due in part to compounding underlying biases which neutralize one another over time Yoshikawa et al., 2008), or alternatively suitable global averages may be susceptible to antagonistic regional biases, such as between the tropics and northern high latitudes. Plant-functional-type-level evaluations, such as that of the maximum rate of RuBisCO carboxylation and canopy height by Lawrence et al. (2019), demonstrate the performance of underlying variables in influencing large-scale carbon fluxes and stocks. Several participants included latitudinal-scale evaluations (Delire et al., 2020;Hajima et al., 2020;Mauritsen et al., 2019), which are both informative and readily comparable to observations. A comprehensive validation should therefore encompass a range of scales and a variety of variables to demonstrate model performance not only for producing suitable averages or accumulations but also for representing processes.

Reference datasets
Satellite-based remote sensing of terrestrial biogeochemical components has been conducted for almost 50 years, since the launch of the Landsat satellite in 1972 (Xiao et al., 2019;Mack, 1990), while field-based experimental and observational data has been available since at least the early 19th century (Holland et al., 2015). Just in terms of satellite-based observational data products, there are currently thousands of examples available (Waliser et al., 2020). Despite this seem-ing wealth of observational data and observation-based data products, the implementation of a variety of observationbased references for validation of terrestrial biogeochemical cycles within ESMs and LSMs is challenging for several reasons. These include the specifications required for direct model output comparison, inconsistent spatial and temporal domains, missing observations, logistical biases, and large uncertainty in global-scale data products (Delire et al., 2020;Collier et al., 2018;Lovenduski and Bonan, 2017). The incomplete coverage of observational datasets in space-time dimensions has led to significant bias in comparisons of model data and observation data previously (de Mora et al., 2013), though this has not been generally discussed in validation exercises by CMIP6 participants. Observational discontinuity has been addressed previously in a LSM validation by Orth et al. (2017), which excluded daily observation reference averages when more than 1 h of data from a 24 h period was missing, and through exclusion criteria in Collier et al. (2018). For example, the compilation of satellite observations to develop a LAI data product with one observation-based estimate every 15 d by Zhu et al. (2013) for monthly average or seasonal extrema comparison would require careful consideration for comparison to model averages computed from more resolved output. In an analysis of how sparse historical measurements compare to continuous model output, Another approach to overcome spatial discontinuity may be to compare broad gradients or trends in a given variable with reference datasets, such as regional and functional-type trends in forest carbon stocks rather than a global summation or average , to investigate whether or not the model captures enduring spatial patterns. In addition, some observational methods may invoke inherent bias, such as satellite-based observation estimates of LAI in mid to high latitudes seasonally under-estimating LAI due to snow cover, leading to ambiguous model performance assessment (Ziehn et al., 2020;Liu et al., 2018). Observational uncertainty can be addressed by applying a weighting to reference datasets as in ILAMBv2.1, as well as by using more than one observational reference when available Sellar et al., 2019;Collier et al., 2018). Careful consideration of spatiotemporal discontinuity in observations and inherent bias is warranted in future validations, which can be achieved through filtered exclusions, site-level comparisons, pattern comparison, certainty weighting of datasets, and the use of more than one reference dataset.
The globally gridded 1982-2008 GPP data product frequently used for GPP validation by CMIP6 participants was developed from machine learning upscaling of site-level eddy-covariance Fluxnet observations with model tree ensembles based on remote sensing vegetation indices, meteorological data, and land use (Jung et al., 2011). Observationbased estimates of GPP can be obtained through satellitederived vegetation indices such as the normalized difference vegetation index (NDVI;Phillips et al., 2008) and solarinduced chlorophyll fluorescence , in addition to ground-based monitoring of turbulent CO 2 fluxes with the eddy covariance technique (Jung et al., 2009). Logistical challenges with eddy covariance-based techniques of estimating GPP can result in potentially extensive data gaps and systematic omission of diel cycle observations (Rodda et al., 2021;Erkkilä et al., 2018;Jung et al., 2011;Lasslop et al., 2010Lasslop et al., , 2008Desai et al., 2008). For example, in a study of eddy-covariance monitoring of CO 2 flux, Jonsson et al. (2008) report only 34 % data coverage of a growing season period, of which 54 % was discarded as it did not demonstrate energy balance closure. To address these challenges Jung et al. (2011) employ Bowen ratio corrections of energy imbalance (Twine et al., 2000), quality control criteria to exclude sites with more than 20 % missing observations, and monthly averages to alleviate noise. Where NEE observations are missing in space, over time driver relationships can be utilized for multi-decadal extrapolation, though only 38 % and 60 % of Fluxnet sites with less than 15 years of observations capture mean conditions and interannual variability of drivers sufficiently well for this extrapolation as of 2015, and most have been operating for less than 5 years (Chu et al., 2017). While the site-level observations from Jung et al. (2011) originate from 212 sites, presenting a globally extensive network, regions with an important contribution to overall carbon stocks and fluxes are underrepresented (Jung et al., 2020), and even the recent global Fluxnet GPP data product by Jung et al. (2016) has just 14 tropical and 5 Arctic sites. GPP observations from Fluxnet products currently do not account for fire and waterbody emissions, which prompts regional and interannual bias (Jung et al., 2020). Despite these caveats, such global-scale data products provide a critical resource to the CMIP community in conducting model validation , and the relatively common use of Jung et al. (2011) for validations by CMIP6 participants coincidentally reduces the influence of observational contradiction (Xie et al., 2020;Anav et al., 2015). Site-level GPP evaluation with observations from the tropics by Delire et al. (2020) and Vuichard et al. (2019) demonstrates a strategic approach to addressing the representation bias in GPP validations. Site-level evaluations often benefit from a wealth of available information, including spatially consistent meteorological forcing, and avoid the influence of spatial extrapolation error. While Jung et al. (2011) do not provide uncertainty measures, several forms of uncertainty are explicitly presented for the Fluxnet2015 dataset by Pastorello et al. (2020). Therefore, the utility of Fluxnet GPP data products could be improved with standardized use by participants in conjunction with other independent data products, select site-level evaluations, explicit uncertainty quantifications, and improved ecological representation in underlying site-level data.

Statistical metrics and validation approaches
Several participants relied primarily on residual-based metrics, such as bias (simulated-observed), for validation of terrestrial biogeochemical cycle model components. On a spatial basis, bias can identify significant regional over-or under-estimations of a given variable. However, the attribution of model error from global maps of bias can be ambiguous, as the displayed bias is the combined result of different forms of uncertainty, including model structural representations, unforced variability, and spatial disagreement Lovenduski and Bonan, 2017;Koch et al., 2016). Such residual-based metrics may not indicate how well the model would perform in simulating future conditions beyond the current contextual envelope of observations (Gulden et al., 2008) and neglect the contribution of uncertainty from observations. These limitations are considerable in the context of ESMs and LSMs as tools for predicting terrestrial biogeochemical function. A more contextualized bias assessment is the Wilcoxon test as applied by Swart et al. (2019) to filter insignificant bias. In a LSM evaluation, Orth et al. (2017) provides an observationally robust bias assessment by subtracting mean seasonal cycles from each grid cell and correlating the resulting anomalies between observation-based datasets and model output. In addition, RMSE normalized by the mean or standard deviation of the observed quantity (NRMSE) contextualizes the difference between simulated and observed variable quantities in terms of the magnitude or inherent variability of the variable of interest (Swart et al., 2019;Fan et al., 2018), which is advantageous for variables such as GPP with large interannual variability.
Beyond these, a variety of targeted model skill metrics have been published for process-based modelling that provide detailed assessments of different forms of model uncertainty Orth et al., 2017;Eyring et al., 2016b;Koch et al., 2016;Law et al., 2015;Kumar et al., 2012;Taylor, 2001;Kobayashi and Salam, 2000). Mean squared deviation, the sum of squared bias, squared difference between standard deviations, and lack of correlation weighted by standard deviations, presented by Kobayashi and Salam (2000), was used by Vuichard et al. (2019). This metric is readily applicable to the objective validation and improvement of mechanistic models, as its dissection allows for the accurate attribution of different sources of model errors. Additionally, a Taylor diagram (Fig. 4, Taylor, 2001) conveys several dimensions of model error, allows for the concise simultaneous display of variables and models, was utilized in the evaluation of BCC-AVIM2 , NORESM2 (Seland et al., 2020), and several LSMs and ESMs by Anav et al. (2015), and is incorporated into IL-AMBv2.1 . The Taylor diagram was designed for simultaneous performance comparison of several simulated variables and serves as a concise and informative validation tool. Caution is warranted however in the evaluation of fully coupled model output due to the inability of fully coupled models to reproduce the timing of internal climate variability phenomena such as El Niño-Southern Oscillation (ENSO; Flato et al., 2013). While the magnitude of observed and simulated internal climate variability may be statistically consistent, bias, RMSE, and NRMSE assessments of fully coupled model output should encompass decadal or longer periods to address the influence of temporal mismatches in simulated internal climate variability relative to observational records. Alternatively, as offline simulations can be directly forced with historical observation data, the output of offline simulations can be validated on a finer temporal scale.
For example, Taylor diagrams of global and regional NPP by Anav et al. (2015) demonstrate a consistent low correlation and high standard deviation for model estimates in the tropics that is substantially reduced in the extratropics and globally, warranting focus on tropical NPP. The validation process of terrestrial biogeochemical cycles and dissection of model uncertainty may also be enhanced through offline simulations or models with intermediate complexity as these allow for a greater replication of simulations with different initializations, forcing datasets, and model configurations due to their computational affordability Umair et al., 2018;Orth et al., 2017). Offline simulations also reduce the potential for incidental compounding error from coupling components, though this leads to an under-estimation in uncertainty for equivalent fully coupled simulations. Replicate simulations with different initial conditions, such as those performed by Danabasoglu et al. (2020), allow for the attribution of uncertainty from unforced variability, which accounted for half of the inter-model spread in key variables previously Eyring et al., 2019). In addition, replicate simulations with different forcing datasets can indicate the role of forcing uncertainty (Wei et al., 2018), which Lawrence et al. (2019) found to be significant. Further, sensitivity analyses or perturbed parameter analyses involving replicated simulations with one or more variables fixed as performed by Hajima et al. (2020) and  illuminate structural uncertainty. The use of wellestablished statistical and model performance metrics in addition to strategic simulations facilitates a detailed analysis of model uncertainty.

Moving forward
A model can only be expected to perform well in simulating past, present, and future conditions if it is provided with  Taylor (2001). The standard deviation of model fields is displayed as the radial distance from the origin and can be visually compared to the observed (reference) point, which is indicated by a circle on the abscissa. The correlation between the model and observed fields decreases with azimuthal angle (dotted lines), and the root-mean-square difference between the model and observed fields is proportional to the distance from the reference point (quantified by dashed contours).
high-quality observational constraints. Lovenduski and Bonan (2017) suggest that obtaining accurate observations and improving process understanding should take precedence over reducing model spread, as constraining models to uncertain observations does not improve their predictive capacity, and even models that agree well with observations can prompt divergent projections. Several of the challenges inherent in implementing observations in model validation and development are now a key focus of the Observations for Model Intercomparison Project (obs4MIPs; Waliser et al., 2020), which strives to deliver long-term, high-quality observations from international efforts. An obs4MIPs meeting held in preparation for CMIP6 with more than 50 satellite data and global climate modelling experts identified underutilized observation products and recommended new efforts to address knowledge gaps, including an expanded inventory of datasets, higher-frequency datasets and model output, more reliable uncertainty measures, more datasets tailored to offline simulations, and more explicit metadata for modellers (Waliser et al., 2020). Further, recent satellite missions such as the Sentinel2A twin satellite launched in 2015 have unprecedented spectral, spatial, and temporal resolution combinations, which can be used alone or in combination with other satellite-based observations to provide higher-fidelity references for validation (Vafaei et al., 2018). Field experimental data provide unique insight as to the functional responses of vegetation to elevated CO 2 concentration (Goll et al., 2017), temperature change (Richardson et al., 2018), moisture availability (Williams et al., 2019;Hovenden and Newton, 2018), and nutrient limitations (Fleischer et al., 2019) outside the current context of observations. The integration of experimental findings in evaluations is challeng-ing given the environmentally rapid application of treatments and limited ecological representation (Nowak et al., 2004), though sophisticated relationship-based techniques such as used by Goll et al. (2017) alleviate some of these issues. Increased collaboration between field and model researchers in designing experiments could improve the applicability of future experiments. In addition, enhanced field and remote sensing collaboration would allow for higher-fidelity calibrated global data products (Orth et al., 2017;Verger et al., 2016). Thus future CMIPs will benefit from forthcoming collaborations and reference data products tailored for validation.
A standard protocol for the validation of terrestrial biogeochemical variables would facilitate a thorough and objective assessment of model performance within and among participants. Further, the collective merits and limitations of the current variety of approaches utilized by participants could be consolidated and addressed in a comprehensive protocol. In the interest of model improvement and weighting for predictions, validation with an exhaustive assessment of variables across a range of spatiotemporal scales against all available peer-recommended observation-based references is optimal. Dataset-specific expertise is also warranted to correctly implement reference datasets in these evaluations (Waliser et al., 2020;Liu et al., 2018). The procurement and application of reference datasets within validations is demanding for participants, considering their presiding obligation to continuously refine model components and participate in CMIP with computationally expensive ESM simulations. Additionally, the universal inclusion of often overlooked processes such as moisture limitation, nitrogen and phosphorus cycles, dynamic vegetation, prognostic leaf phenology, and natural disturbance regimes should be a priority focus for participants in developing diagnostic models as these processes are highly influential on terrestrial biogeochemistry and physics Fleischer et al., 2019;Piao et al., 2019;Wieder et al., 2015;Achard et al., 2014;Richardson et al., 2013;Heimann and Reichstein, 2008;Tucker et al., 1986), and the omission of these processes contributes to widespread bias (Green et al., 2019;Anav et al., 2015). While outside the focus of this review, equal attention should be applied to the physical components of terrestrial biogeochemical cycles, including explicit representation of permafrost and riverine carbon transport dynamics. In fact, a study including four CMIP5 ESMs found that soil moisture variability prompted variability in terrestrial NBP on the order of gigatonnes, with non-linear responses to both moisture scarcity and excess (Green et al., 2019). Further, many of the merits and limitations of the validation approaches discussed herein apply to the validation of these physical components as well.
The communal use of software packages such as ESM-ValToolv2.0 and ILAMBv2.1  could liberate time and computational resources for modellers. In addition, this would standardize validation protocols, address long-overlooked model uncertainty distinctions , and avoid terminology confusion (Lovett et al., 2006). While these packages include extensive suites of peer-verified observational reference datasets and performance metrics, these packages do not yet include evaluation of nitrogen and phosphorus cycles, which may be due to the combined scarcity of observations, upscaling approaches, and model representations Zhu et al., 2018;Wieder et al., 2015;Zaehle and Dalmonech, 2011). The strategic situation of nitrogen, phosphorus, and soil moisture monitoring, which coincides with current Fluxnet sites (Jung et al., 2020), could provide highfidelity insight into nutrient and environmental limitations on GPP, coherent turnover time assessments, and broadly applicable functional relationships to facilitate upscaling. The co-situation of multiple observational monitoring objectives at Fluxnet sites would enhance the utility of each site-level dataset and alleviate errors due to spatiotemporal inconsistencies between datasets in both performing evaluations and developing large-scale data products. Following increased collaboration between empirical and modelling communities to strategically expand observations and their inclusion in a comprehensive evaluation software, the CMIP-designated use of such software would standardize, conserve, and augment validation efforts.

Conclusion
The current generation of ESMs that participated in the sixth phase of the Coupled Model Intercomparison Project adopted a broad assortment of approaches to validate historically simulated terrestrial biogeochemical cycles. Validations which encompassed a large suite of variables over a range of spatiotemporal scales in conjunction with informative model performance metrics demonstrated relatively comprehensive assessments of model performance. Across CMIP6 participants, the variety of variables, reference datasets, evaluation dimensions, and statistical metrics utilized make general assessments of model performance in simulating terrestrial biogeochemistry challenging. To address this inconsistency and alleviate the immense responsibilities of participants, we recommend the designation of a standard validation protocol for CMIP participants, which is consolidated in an opensource software (such as the Earth System Model Evaluation Tool version 2 (ESMValToolv2.0) or the International Land Model Benchmarking version 2.1 (ILAMBv2.1)). This protocol should utilize a comprehensive suite of certaintyweighted observational reference datasets, targeted model performance metrics, and comparisons across a range of spatiotemporal dimensions. The insights from a universally adopted validation protocol would precisely attribute model uncertainty and aid in directing future observational efforts to improve crucial process understanding within terrestrial biogeochemical cycles.

L. Spafford and A. H. MacDougall: Validation of terrestrial biogeochemistry in CMIP6
Appendix A: Technical summary of validation activities by participants

A1 CSIRO
The Australian Community Climate and Earth System Simulator (ACCESS-ESM1.5) was developed by the Australian modelling group Commonwealth Scientific and Industrial Research Organization (CSIRO) for participation in CMIP6 (Ziehn et al., 2020). The land surface model used in ACCESS-ESM1.5 is the Community Atmosphere Biosphere Land Exchange (CABLE) model (Kowalczyk et al., 2013(Kowalczyk et al., , 2006 Ziehn et al. (2011). Simulated LAI magnitude and seasonality was compared to global and regional estimates based on Moderate Resolution Imaging Spectroradiometer (MODIS) and Advanced Very High-Resolution Radiometer (AVHRR) data from Zhu et al. (2013). Simulated surface CO 2 concentrations in terms of mean seasonal cycle amplitude and timing were compared to four NOAA/Earth System Research Laboratory station flask samples provided in the GLOBAL VIEW data product (GLOBAL VIEW-CO 2 , 2013).

A2 BCC
The Beijing Climate Centre (BCC) participated in CMIP6 with the BCC Climate System Model version 2 with medium resolution (BCC-CSM2-MR; Wu et al., 2019). Land biogeochemistry in BCC-CSM2-MR was simulated through the BCC Atmosphere and Vegetation Interactive Model version 2.0 (BCC-AVIM2; Li et al., 2019). While Wu et al. (2019) did not provide validation results for terrestrial biogeochemistry from BCC-CSM2-MR, a detailed validation with offline simulations of BCC-AVIM2 was provided by Li et al. (2019) using the Princeton global forcing dataset (Sheffield et al., 2006). Li et al. (2019) compared the annual peak month, seasonal average, and global average of LAI to satellite observations from 1982 to 2010 by the AVHRR (Myneni et al., 1997). Surface carbon fluxes including GPP and ER were compared to upscaled Fluxnet observations from Jung et al. (2011). Aboveground biomass was compared to Avitabile et al. (2016), while global total biomass carbon from 1990 to 2010 was compared to Saatchi et al. (2011). The performance of BCC-AVIM2 in estimating each of these variables was assessed through bias, RMSE, and Taylor diagram metrics (Taylor, 2001).

A3 CCCma
The Canadian Centre for Climate Modelling and Analysis (CCCma) participated in CMIP6 with the CCCma fifthgeneration Earth System model (CanESM5; Swart et al., 2019). The land biogeochemistry component of CanESM5 is the Canadian Terrestrial Ecosystem Model (CTEM; Boer, 2010, 2005). Swart et al. (2019) compared CanESM5 simulated GPP from 1982 to 2009 with observation-based estimates from Jung et al. (2009) in terms of geographical distribution, zonal averages, and functional relationships with air temperature and precipitation. Several metrics were used to illustrate CanESM5's performance in simulating GPP, including the correlation coefficient (r) between simulated and observed spatial patterns in GPP, bias (simulated − observed), and root-mean-squared error (RMSE) normalized (NRMSE) by observed spatial standard deviation. Global average decadal land-atmosphere CO 2 flux and net cumulative atmosphere-land CO 2 flux from 1850 to 2014 were compared to observation-based estimates from the Global Carbon Project (GCP; Le Quéré et al., 2018), the latter by subtracting cumulative land use emissions from cumulative land carbon uptake.

A4 Climate and Global Dynamics Laboratory (NCAR)
The Community Earth System Model version 2 (CESM2) was developed by the Climate and Global Dynamics Laboratory at the American National Centre for Atmospheric Research (NCAR) for participation in CMIP6 (Danabasoglu et al., 2020). The land component of CESM2 is the Community Land Model Version 5 (CLM5; Lawrence et al., 2019). Danabasoglu et al. (2020) and Lawrence et al. (2019) comprehensively assessed terrestrial biogeochemical cycle variable outputs from simulations of CESM2 and CLM5, respectively, with the International Land Model Benchmarking package (ILAMBv2.1; Collier et al., 2018), including an explicit analysis of interannual variability with a threemember ensemble from different pre-industrial control initialization years (CESM2), the influence of forcing through the use of three forcing datasets (CLM5), and the influence of prescribed versus prognostic vegetation phenology (CLM5). ILAMBv2.1 utilizes a suite of data products weighted by certainty. These included vegetation biomass (tropical: Saatchi et al., 2011;global: Kellndorfer et al., 2013;Blackard et al., 2008), burned area (Giglio et al., 2010), CO 2 concentrations, GPP (Fluxnet: Lasslop et al., 2010;Global biosphereatmosphere flux: Jung et al., 2010), LAI (AVHRR: Myneni et al., 1997;MODIS: de Kauwe et al., 2011), global net ecosystem carbon balance (GCP: Le Quéré et al., 2014;Hoffman et al., 2014), net ecosystem exchange (Fluxnet: Lasslop et al., 2010;GBAF: Jung et al., 2010), NBP, ER, NEP (equivalent to GPP-ER), soil carbon (Harmonized World Soil Database (HWSD): Todd-Brown et al., 2013;Northern Circumpolar Soil Carbon Database (NCSCDV22): Hugelius et al., 2013), and 10 functional relationships. Lawrence et al. (2019) also compared the relationship between apparent soil carbon turnover times versus air temperature to observation-based estimates developed from HWSD, NCSDV22, and MODIS. Lawrence et al. (2019) additionally compared maximum monthly LAI and average V cmax25 (maximum RuBisCO carboxylation rate at 25 • C and high irradiance per unit leaf area in µmol m −2 s −1 ) at the PFT-level for the year 2010 to Zhao et al. (2005) and Kattge et al. (2009), respectively, as well as canopy height for the year 2005 for tree PFTs to Simard et al. (2011). Nitrogen cycle variables evaluated by Lawrence et al. (2019) with observational references included nitrogen deposition (Fowler et al., 2013), symbiotic fixed nitrogen , soy fixed nitrogen (Herridge et al., 2008), crop nitrogen fertilization (Fowler et al., 2013), denitrification (Fowler et al., 2013), hydrologic nitrogen losses (Fowler et al., 2013), fire losses (Lamarque et al., 2010), and N 2 O flux (Fowler et al., 2013). Different climate forcing datasets and anthropogenic forcings were utilized to examine the effect of climate, CO 2 emissions, land use change, and nitrogen additions on carbon cycle variables and three CLM versions to partition total uncertainty into forcing and model contributions using fixed-effect analysis of variance, with additional PFT-level analysis and prognostic versus prescribed vegetation and carbon cycling for CLM5. In addition to the ILAMB validation, Danabasoglu et al. (2019) and Lawrence et al. (2019) compared simulated global net biome production (NBP) and cumulative land carbon sink to observation-based estimates from 1850 to 2014 from the GCP for 1959-2014(Le Quéré et al., 2016, and from Hoffman et al. (2014) for 1850-2010. Observation-based GPP, ER, and NEP (equivalent to GPP-ER) comparison data were obtained from Jung et al. (2011Jung et al. ( , 2010. Vegetation carbon was evaluated relative to observations for the tropics from Saatchi et al. (2011) and the GEOCARBON and Global-Carbon datasets Avitabile et al., 2016;Santoro et al., 2015). ILAMBv2.1 results from these investigations comprised a collection of statistical metrics for annual mean, bias, relative bias, RMSE, seasonal cycle phase, spatial distribution, and interannual variability, in addition to functional relationships. Bonan et al. (2019) provides a detailed analysis of the role of climate forcing uncertainty in influencing CLM5 output.

A5 CNRM and CERFACS
The Centre National de Recherches Météorologiques (CNRM) and Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique (CERFACS) contributed the CNRM-ESM2-1 to CMIP6 . The land component in CNRM-ESM2-1 is the Interaction Soil-Biosphere-Atmosphere with Total Runoff Integrating Pathways with carbon cycling (ISBA-CTRIP; Delire et al., 2020). Séférian et al. (2019) compared CNRM-ESM2-1 simulated annual minimum and maximum LAI to AVHRR observa-tions from 1998 to 2011 . The simulated land carbon sink from 1982 to 2010 was compared to a multimodel estimate by Huntzinger et al. (2013). These validations included spatial bias, global mean bias, RMSE, and spatial error correlation between CNRM ESM versions to distinguish model sources of error. Delire et al. (2020) validated offline ISBA-CTRIP simulated GPP, NPP, autotrophic respiration, and ER from 1980 to 2010 with estimates with the mean of 12 products from the FluxComv1 dataset (Jung et al., 2017, and a satellite product from the Numerical Terradynamic Simulation Group: MODIS17A3 (NASA LP DAAC, 2017; Zhao et al., 2005), with reference autotrophic respiration calculated as the mean of FLUXCOM GPP products minus MODIS17A3 NPP. Simulated crop NPP for the 2000s was compared to the Harvested Area and Yield dataset (Monfreda et al., 2008). Carbon use efficiency (CUE), calculated as the ratio of NPP to GPP, was evaluated with observation and model-based estimates for tropical evergreen forest from Malhi et al. (2009), and tropical deciduous, temperate, and boreal forests from He et al. (2018), Zhang et al. (2014), and theoretical derivations by Amthor (2000). Simulated heterotrophic respiration was evaluated with a data product from Hashimoto et al. (2015) which combines global and Amazonian in situ observations from the Soil Respiration database (Bond-Lamberty et al., 2018) and Malhi et al. (2009), respectively, and global gridded climate data. The simulated burned area and fire CO 2 emissions were compared to Mouillot and Field (2005) and the Global Fire Emissions Database version 4.1 . Simulated dissolved organic carbon yield leached from soil was compared to model results of Mayorga et al. (2010) and observations by Dai et al. (2012). Simulated global aboveground biomass carbon was validated with observation-based estimates from 1993-2012 from Liu et al. (2015), regional datasets for mid-high northern latitudes from Thurner et al. (2014), and tropical datasets from Saatchi et al. (2011) andBaccini et al. (2012). Simulated aboveground litter carbon was compared to site measurements from 1827 to 1997 from the Global Database of Litterfall Mass and Litter Pool Carbon and Nutrients (Holland et al., 2015). Simulated belowground organic carbon was validated with the HWSDv1.2 (FAO, 2012). Vegetation turnover time calculated as biomass divided by NPP and soil turnover time calculated as the combination of litter and soil carbon divided by NPP for 1984-2014 were also computed for validation. Delire et al. (2020) also used local scale Fluxnet data from Joetzjer et al. (2015) to assess ISBA-CTRIP performance. Each variable was validated through comparison of the distribution of simulated and observation-based estimates of annual averages, as well as zonal averages, and the spatial distribution of the bias (simulated minus observed). Average simulated carbon fluxes from 2006-2015 and the trend from 1960-2015 were also compared to observation-based estimates from the GCP (Le Quéré et al., 2018) and Ciais et al. (2019).

A6 IPSL
The Institut Pierre Simon Laplace (IPSL) participated in CMIP6 with IPSL-CM6A-LR, the land component of which was the ORCHIDEE land surface model version 2.0 (Boucher et al., 2020;Hourdin et al., 2020). Boucher et al. (2020) evaluated IPSL-CM6A-LR simulated average annual carbon fluxes from 1990 to 1999 and 2009 to 2018 resulting from land cover change, fossil fuel emissions, the terrestrial sink, and total net land fluxes (the terrestrial sink minus land cover change) with observation-based estimates from the 2019 GCP (Friedlingstein et al., 2019). Vuichard et al. (2019) validated ORCHIDEE simulated GPP in terms of the mean annual, seasonal, and daily simulated GPP on a PFT, spatial, and global basis against observations from 78 Fluxnet sites (Vuichard and Papale, 2015) and the globalscale MTE-GPP product based upon upscaled Fluxnet observations for 1982-2008 (Jung et al., 2011). RMSE and dissected mean-squared deviation (MS; the sum of squared bias, squared difference between standard deviations, and lack of correlation weighted by standard deviations, based on Kobayashi and Salam, 2000) metrics were used to attribute different sources of uncertainty. The relative contribution of drivers of variation in present-day GPP were also assessed, including seasonal variability in NO x and NH x deposition and the leaf carbon to nitrogen ratio. The sensitivity of OR-CHIDEE output to model structure in terms of MSE was also analyzed on a global and PFT-level basis, including fixed and dynamic fully coupled carbon-nitrogen cycles.

A7 GFDL
The American National Oceanic and Atmospheric Administration Geophysical Fluid Dynamics Laboratory (GFDL) participated in CMIP6 with GFDL-ESM4.1 (Dunne et al., 2020), in which land biogeochemistry is simulated with the GFDL Land Model version 4.1 (LM4.1; Shevliakova et al., 2021). Dunne et al. (2020) validated GFDL-ESM 4.1's simulated spatial distribution of seasonal amplitude in CO 2 concentrations and interannual variability of CO 2 concentrations compared to NOAA Global Monitoring Division sites with at least 15-year records (Global Monitoring Laboratory, 2005) using RMSE and the coefficient of determination (r 2 ), as well as the correlation coefficient (r) for individual sites.  Quéré et al., 2018). Observationbased data products used for other comparisons included (1) the spatial pattern, gradient across biomes, magnitude, seasonality, and length of growing season of global gridded GPP from 1986 to 2005 from Fluxnet (Jung et al., 2011); (2) the magnitude and density of forest carbon stock (Kindermann et al., 2008); and (3) global and regional soil organic carbon from the harmonized soil property values for broad-scale modelling (WISE30Sec; Batjes, 2016), the northern high latitudes from the Northern Circumpolar Soil Carbon Database version 2 (NCSCDv2; Hugelius et al., 2013), and an estimate from Todd-Brown et al. (2013) (Gruber and Galloway, 2008), present-day BNF Herridge et al., 2008), annual unperturbed state terrestrial N 2 flux (Gruber and Galloway, 2008), and change in annual soil nitrous oxide emissions from 1850 to 2014 relative to a model comparison study by Tian et al. (2018).

A9 MPI
The Max Planck Institute for Meteorology (MPI) Earth System Model version 1.2 Low Resolution (MPI-ESM1.2-LR) was developed for participation in CMIP6 (Mauritsen et al., 2019) by the MPI, the land component of which is JS-BACH3.2 (Goll et al., 2017). Mauritsen et al. (2019) compared the spatial variability and zonally averaged density of MPI-ESM1.2-LR simulated soil and litter carbon stocks to estimates by Goll et al. (2015) developed from the Harmonized World Soil Database. The simulated evolution in global total land carbon from 1850-2013 was compared to estimates provided by Ciais et al. (2013). Additionally, simulated land use change carbon emissions from 1860 to 2013 were compared to estimates provided by Ciais et al. (2013). In a model description paper of JSBACH version 3.10, which was set to be used in CMIP6, Goll et al. (2017) compare JSBACH3.1 simulated present-day NPP to Ito (2011), while simulated present-day biomass carbon was compared to Saugier and Roy (2001) and Ciais et al. (2013). The simulated response of NPP and GPP to increases in atmospheric CO 2 were compared to experimentally observed estimates from four free-air CO 2 enrichment (FACE) experiments (Norby et al., 2005) and an intramolecular isotope distribution examination of plant metabolic shifts (Ehlers et al., 2015). Simulated present-day biomass nitrogen was compared to Schlesinger (1997), while simulated present-day total nitrogen was compared to Galloway et al. (2013). Simu-lated values of pre-industrial (1850) and present-day leaching and BNF were compared to Galloway et al. (2013Galloway et al. ( , 2004, Vitousek et al. (2013), and short-term experimental results from a meta-analysis by Liang et al. (2016), while simulated present-day denitrification was compared to Galloway et al. (2013). Goll et al. (2017) also verified the simulated spatial variability in reactive nitrogen-loss pathways using a compilation of nitrogen-15 isotopic data (Houlton et al., 2015) with the statistical metrics r, RMSE, and Taylor score (Taylor, 2001).

A10 NCC
The Norwegian Earth System Model (NORESM2) was developed for participation in CMIP6 (Seland et al., 2020) by the Norwegian Climate Consortium (NCC), and is based on CESM2. As in CESM2, the land model in NORESM2 is CLM5 . The performance of NORESM2 was validated through a three-member ensemble of historical simulations from 1850 to 2014 with slightly varying initial conditions. Simulated carbon cycle variables that were compared to observation variables included GPP, soil carbon, and vegetation carbon from Jung et al. (2011) (Jung et al., 2011). The areal land cover of aggregated plant functional types (PFTs) was validated with satellite observation-based datasets from the European Space Agency Climate Change Initiative Land Cover data (Poulter et al., 2015) and the International Geosphere-Biosphere Programme (IGBP) Land Use and Cover Change project (Loveland et al., 2000) using the model year 2005. The coverage of PFTs were validated using these observation-based datasets as references both spatially and as a fraction of biomes based upon regions defined by Olson et al. (2006). The simulated vegetation carbon distribution was validated on a latitudinal basis with observationbased estimates from GEOCAROBON (Avitabile et al., 2016) and Saatchi et al. (2011), while the spatial distribution of soil carbon was validated with observation-based estimates WISE30sec (Batjes, 2016), IGBP-DIS (Global Soil Data Task Group, 2002), and Carvalhais et al. (2014). The magnitude of simulated global total soil carbon was compared to whole soil profile observation-based estimates from Carvalhais et al. (2014) and upper 2 m observation-based estimates from Batjes (2016). Cumulative carbon uptake and land use emissions from 1850 to 2014 was compared to observation-based estimates from the GCP (Le Quéré et al., 2018).
Data availability. The data used to generate Figs. 1 and 2 are openly available from CMIP6 model participant papers (see Table 1).
Author contributions. LS and AHMD both initiated the research and significantly contributed to the writing of the paper. LS conducted the analysis and wrote the original draft. AHMD provided supervisory support.
Competing interests. The contact author has declared that neither they nor their co-author have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.