Using the International Tree-Ring Data Bank (ITRDB) records as century-long benchmarks for land-surface models

1 Department of Ecological Sciences, VU University, 1081HV Amsterdam, the Netherlands. 5 2 Laboratoire des Sciences du Climat et de l’Environnement, IPSL, CNRS/CEA/UVSQ, 91191 Gif sur Yvette, France. 3 Instituto de Conservación Biodiversidad y Territorio, Universidad Austral de Chile, 5090000 Valdivia, Chile. 4 CSIRO Oceans and Atmosphere, Canberra, 2601, Australia. 5 Department of Geology & ESSIC, University of Maryland, MD 20742-4211, USA. 6 Dendro Sciences Group, Swiss Federal Research Institute WSL, Zürcherstrasse 111, CH-8903 Birmensdorf 10 7 School of Natural Resources and the Environment, University of Arizona, Tucson, USA 8 Laboratory of Tree-Ring Research, University of Arizona, Tucson, USA


2000)
, and account for a decoupling of photosynthesis and growth by the use of a labile carbon pool (Friend et al., 2019;Naudts et al., 2015;Zaehle and Friend, 2010). Plant water availability is accounted for through either simple transfer functions or more recently by accounting for the hydraulic architecture of the simulated trees (Bonan et al., 2014;Naudts et al., 2015). 120 (iii) Endogenous disturbances refer to within-stand resource competition and are being increasingly simulated in landsurface models albeit often by empirical approaches (Haverd et al., 2013;Moorcroft et al., 2001;Naudts et al., 2015). From a benchmarking point of view, simulating individuals of different size or cohorts within a single forest is essential to reproduce the sampling biases present in the ITRDB (see section 3 below). Chronic exogenous disturbances such as increasing atmospheric CO2 concentration (LaMarche et al., 1984) and N-deposition (Magnani et al., 2007) are also well-developed as 125 they are among the main purposes of using land-surface models. The effect of CO2 fertilization on photosynthesis is accounted for in the photosynthetic submodel (see above) whereas nitrogen dynamics are accounted for through static or dynamic stoichiometric approaches (Vuichard et al., 2019;Zaehle and Friend, 2010) . (iv) Although abrupt disturbances such as fires, pests and storms are increasingly being simulated by land-surface models 130 (Chen et al., 2018;Yue et al., 2014) these functionalities are at present of limited use for benchmarking against TRW data.
Abrupt disturbances are often simulated as stand-replacing disturbances and will, therefore, not be reflected in the simulated TRWs. Furthermore, the timing of such events largely depends on the simulated diagnostics, for example, fuel wood build-up, insect population dynamics, and soil moisture, which could strongly deviate from the observed timing in decadal to century long simulation periods. 135 (v) The final term in the aggregate tree growth model constitutes all processes and interactions between processes not previously accounted for in the land-surface model, and will make up the model error.
This aggregate tree growth model provides the conceptual basis for tree-ring standardization and climate signal extraction 140 methods used in dendrochronology (Briffa and Melvin, 2011;Cook and Kairiukstis, 1990), which rely on the assumptions that the sampled trees capture the relevant growth variability of the stand and that the contribution of each major driver can be https://doi.org/10.5194/gmd-2020-29 Preprint. Discussion started: 14 April 2020 c Author(s) 2020. CC BY 4.0 License. statistically identified as either signal or noise. If tree-rings are formed, observed TRW records cannot, however, be fully decomposed in the absence of metadata because several drivers do not leave a unique fingerprint in the tree-ring record.
However, size effect and climate sensitivity have a much larger contribution to TRW than the other processes (Hughes et al., 145 2011). Nevertheless, alternative approaches have been proposed to attribute TRW to its major drivers (Stine, 2019).
In addition to accurate process representation, the model will need to be driven by historical climate, atmospheric CO2 concentrations and N-deposition. In general, commonly-used century-long climate reanalyses such as NCEP (Kalnay et al., 1996), 20CR (Compo et al., 2011), and CERA-20C (Laloyaux et al., 2018) are based on the assimilation of instrumental observations 150 in climate simulations and are thus independent from tree-ring-based observations or other proxy data. Nevertheless, the accuracy of the reanalyses decreases proportional to data availability, particularly in remote areas with a low density and temporal depth of meteorological stations, and given that local climate effects may have contributed to the TRW, it might be desirable to align the reanalysis with present day site-specific observations where they exist (Ols et al., 2018). When LSMs are forced by actual climate observations, reproducing the observed climate sensitivity in tree rings would both facilitate and 155 add credibility to the land-surface simulation -if forcing, LSM and TRW models are all realistic and unbiased.
Given the above, land-surface models that intend to use TRWs as a benchmark should at the minimum simulate: (1) dynamic plant phenology, (2) size-dependent growth, (3) differently-sized trees within a stand, and (4) responses to chronic exogenous environmental changes (Fig. 2). Whereas responses to chronic exogenous environmental changes are the reason LSMs exist 160 and are therefore to some extent accounted for by all current LSMs, size-dependent growth and size differentiation within a stand are at present only accounted for in few land surface models, for example, CLM (ED) (Fisher et al., 2015), ORCHIDEE (Naudts et al., 2015), and LPJ-GUESS (Smith, 2001). Revision 5698 of the ORCHIDEE model meets the aforementioned minimum requirements and therefore will be used in this study. 165 2.2. Challenges of using ITRDB data as a long-term benchmark https://doi.org/10.5194/gmd-2020-29 Preprint. Discussion started: 14 April 2020 c Author(s) 2020. CC BY 4.0 License.
A typical record in the ITRDB consists of TRW measurement of increment cores from tens of individual trees from the same site and species. Each record may have a different starting date, ending date and thus length (Fig. 3 a and b). If a core reaches the centre of the trunk, annual tree diameter can be reconstructed (Bakker, 2005). Even then diameter reconstruction may come 170 with some uncertainty because trunks are not perfectly round. If the core does not contain the centre of the trunk, which is often the case for large trees, rings near the pith will be missed adding uncertainty to the diameter and age reconstruction (Briffa and Melvin, 2011).
Despite of known biases in the ITRDB, it can be used to extract information useful for LSMs. The predominant sampling 175 design in the ITRDB targets the presumably oldest trees, which should give the longest time series and are therefore be most useful to reconstruct the climate variability prior to instrumental records. The ITRDB is thus likely to overrepresent large trees (Brienen et al., 2012;Nehrbass-Ahles et al., 2014) relative to the population demographics at time of sampling. This big-tree bias makes the ITRDB unsuitable to upscale growth of individual trees to larger spatial domains, i.e., stand, forest or the region (Babst et al., 2014a;Nehrbass-Ahles et al., 2014) but does not affect the value of the ITRDB archive for documenting 180 individual tree growth as long as tree size and dominance effects are explicitly considered. Although the model-data comparison cannot -without additional data-correct for the big-tree bias in the ITRDB, models that simulate multiple tree diameter classes may accommodate this bias by comparing the largest simulated diameter class with the observed ITRDB treering records (Fig. 3a).

185
Another bias related to the ITRDB sampling design comes from the fact that the growth rate of trees within a cohort differs between individuals (Melvin, 2004;Zuidema et al., 2011) resulting in slowly and fast-growing trees within the same cohort ( Fig. 3b). Slow-growing trees tend to live longer than fast-growing trees in the same cohort (Mencuccini et al., 2005;Schulman, 1954). Records of TRW are thus likely to underestimate the mean tree growth of a stand in long-passed centuries as fastgrowing trees would have died off before the samples were taken (Brienen et al., 2012).
The other challenge of using ITRDB data is rooted in the difference between the observed and simulated forest structure. Treering datasets are composed by cores of individuals from different cohorts (Fig. 3b). Comparing these data against simulations requires the model to be individual based or to align TRW records by age (Fig. 3a).

195
Given the above, only part of the information contained in TRW records can be used for benchmarking if their sampling protocol is poorly described or not rigorously enforced. A model-data comparison cannot correct for these biases but consistency between modelled and observed tree-rings for a stand under study can be improved by making use of virtual trees.
As the output of LSMs must be coupled to realistic models for TRW, careful post-processing of the ITRDB data may be required to become suitable benchmarks for land surface models. The proposed benchmarks, thus, use three different 200 definitions of virtual trees each of which addresses different aspects: (1) the average virtual tree of a stand aligned by tree age is calculated as the time series for the average ring width after aligning the age of the individual trees (Fig. 3a). Age-aligned tree-ring widths are widely used to calculate a statistic known as the mean regional curve of the sampled stand (Briffa and Melvin, 2011). This assumes that common drivers regardless of time, exceed the signal from local and individual differences in tree growth (see subsection 4.2 (i)); 205 (2) the average virtual tree of a stand aligned by calendar year is calculated by ordering individual tree records by calendar year (Fig. 3b) and for each year the average observed diameter is calculated. Alignment by calendar year thus reflects the real temporal evolution of the stand. This virtual tree can be used to cope with the challenge from difference in forest structure between the simulation and the observation by compiling a representative and comparable tree with the simulated tree (see subsection 4.2 (ii)); 210 (3) the largest virtual tree of a stand is calculated after aligning individual trees by their age (Fig. 3c). The recommendation to remove the age trend from tree-ring records (Cook et al., 1995) confirms the assumption underling the alignment by age, i.e., that size dependent age exceeds the growth trends due to long-term environmental changes. Subsequently, the age-aligned TRWs can be used to compile a virtual fast-growing tree which has the maximum observed diameter of all trees for a given tree age. The virtual fast-growing tree thus gives a better idea of the true mean tree growth in old stands. (see subsection 4.2 ORCHIDEE builds on the concept of meta-classes to describe vegetation distribution. By default, it distinguishes 13 metaclasses (one for bare soil, eight for forests, two for grasslands, and two for croplands). Each meta-class can be subdivided into an unlimited number of plant functional types (PFTs). When simulations make use of species-specific parameters and age classes, several PFTs belonging to a single meta-class will be defined. Biogeochemical and biophysical variables are calculated 245 for each PFT or groups of PFTs (e.g. all tree PFTs in a pixel drawn from the same description of soil hydrology, known as a soil water column).
ORCHIDEE is not an individual-based model but instead represents forest stand complexity and stand dynamics with diameter and age classes. Each class contains a number of individuals that represent the mean state of the class. Therefore, each diameter 250 class contains a single modelled tree that is replicated multiple times and distributed at random throughout the PFT area. At the start of a simulation, each PFT contains a user-defined number of stem diameter classes. This number is held constant throughout the simulation, whereas the diameter boundaries of the classes are adjusted to accommodate for temporal evolution in the stand structure. By using flexible class boundaries with a fixed number of diameter classes, different forest structures can be simulated. An even-aged forest, for example, is simulated with a small diameter range between the smallest and largest 255 classes. All classes will then effectively belong to the same stratum. An uneven-aged forest is simulated by applying a large range between the diameter classes. Different diameter classes will therefore effectively represent different strata. The limitations of this approach become apparent when the tree-ring width data and simulation are compared by calendar year as the model does not track individual trees. Although the model tree itself is well-defined, the amount of radiation it receives (and therefore the amount of carbon produced) is determined by the statistical distribution of all model trees in that grid cell. 260 Vegetation structure is then used for the calculation of the biophysical and biogeochemical processes of the model such as photosynthesis, plant hydraulic stress, and radiative transfer model. The r5698 version of ORCHIDEE, which is the version used in this study, combines the dynamic nitrogen cycle of ORCHIDEE r4999 (Vuichard et al., 2019;Zaehle and Friend, 2010) and the explicit canopy representation of ORCHIDEE r4262 (Chen et al., 2016;Naudts et al., 2015;Ryder et al., 2016). It is 265 one of the branches of the ORCHIDEE model and it was further developed from Naudts et al. (2015) and Vuichard et al. (2019) (Text S1), parameterized, and tested to simulate tree-ring widths, in order to meet the aforementioned minimum requirements of simulating the carbon, water, energy, and nitrogen cycle, while accounting for size-dependent allocation for three diameter classes within a forest stand.

270
In this study we use a climate data product from a merged and homogenized gridded dataset developed for modelling purpose over the 20 th century, i.e., CRU-NCEP (Viovy, 2016), such that observed tree-ring widths for the past century can be used to evaluate the skill of the land-surface model ORCHIDEE r5698 to simulate radial tree growth. A detailed overview of earlier developments (Krinner et al., 2005;Naudts et al., 2015;Vuichard et al., 2019) that resulted in the emerging capability of ORCHIDEE r5698 to match the aggregate tree growth model ( Fig. 2) is given in the supplementary material (Text S1). 275

Simulation set-up for the data-model comparison
We selected 10 forest sites from the ITRDB for comparison with simulations, based on the following criteria: (1) located in Europe; (2) composed of Pinus sylvestris L.; (3) between 100 to 150 years old; and (4) ranging from Spain to Finland and therefore thought to largely cover the climatic conditions encountered across the species range of P. sylvestris within Europe. 280 The location of the selected forests is detailed in Table S2. ORCHIDEE r5698 was run for 10 individual pixels, each containing one of the selected sites. An observed time series of atmospheric CO2 concentrations was used (Keeling et al., 1996) and all forest were considered to be unmanaged. Every simulation started from a 300-year long spinup required to bring the simulation to equilibrium with respect to the slow carbon and nitrogen pools in the soil. The start year and the length of each simulation was set to match the site observations. A more detailed description of the test case and the ORCHIDEE model is given in the 285 Supplementary Information (Text S2).
The model run was repeated four times for every site to obtain simulated tree ring widths for four different model configurations. The first configuration is the most basic configuration in this test (hence its label 'basic'): sapling recruitment is not accounted for, the nitrogen cycle is open and the parameter quantifying resource competition within a stand (f_power, 290 for more details see Text S1 Eq. 15 and 16) was fixed at 2. The second configuration (labelled 'power') was a copy of the first but used a modified expression for resource competition (Eq. 16 in Supplementary Information) was used ('power'). The third configuration built on the second but also accounted for recruitment ('recru'). Finally, the fourth configuration used a closed and dynamic nitrogen cycle, recruitment, and the modified within-stand competition f_power ('Ndyn').

295
The configuration with an open nitrogen cycle prescribed the leaf carbon-to-nitrogen-ratios with the average leaf carbon-tonitrogen-ratio obtained from the 'Ndyn' simulation following the method proposed by Vuichard et al. (2019). This ensured that the differences came from the C-N feedbacks rather than from differences in leaf nitrogen. In the absence of bias corrected climate forcing, the simulations cycled through the climate forcing from 1901 to 1910 for the years prior to 1901. From 1901 onwards, climate forcing matching the simulation years was used. The four configurations with increasing model functionality 300 are summarized in Table 2.

Reference data of ecologically sampled TRW data
A "fixed-plot approach" TRW database was established under the BACI project (http://www.baci-h2020.eu/) by archiving 48 datasets from multiple research efforts in Europe (Klesse et al., 2018). To be retained and archived, all trees larger than 5.6 305 cm in diameter had to be sampled in a 10 to 40 m radius plot of which the exact radius depended on stand density . The BACI archive is, therefore, considered to be free from the big-tree selection bias which is present in the ITRDB.
The records from BACI were used to evaluate the validity of using virtual trees constructed from ITRDB records to combat the aforementioned sampling biases.

310
For all four benchmarks each reflecting a different usage of the information contained in the ITRDB, the virtual trees were constructed by making use of only the top 15% of the trees based on their diameter. The 15% threshold was taken to match the diameter distribution in ORCHIDEE where by definition the largest diameter class contains 15% of the trees. Subsequently, the ratio of the diameter of the virtual trees over the diameter of the average stand diameter was calculated to estimate the error introduced by using a virtual tree. Only the 27 coniferous forests contained in the BACI archive were selected to enhance 315 consistency within our test case.

Evaluating the concept of virtual trees
The bias introduced by constructing a virtual tree based on the diameters of the largest 15% of the trees from the 320 representatively-sampled BACI data sets is thought to be a proxy of the bias inherent to the ITRDB (Fig. 4). No statistically significant differences were found between the simulation-based and data-based virtual trees except for the virtual tree used in the benchmark for young trees (t-test, p<0.01). In the latter case the statistical difference seems to be caused the lack of variation in the simulation-based tree rather than a large difference between the means of the virtual trees (Fig. 4). This lack of variation for the simulation-based tree is expected as the method uses the maximum simulated diameter. Hence, the test 325 supports the use of virtual trees constructed from the ITRDB as a benchmark for the largest diameter class of a forest simulated by ORCHIDEE.

Benchmarks for comparing observed and simulated tree-ring widths
If a land-surface model explicitly accounts for the main factors contributing to TRW, i.e., size effects and climate sensitivity 330 (Hughes et al., 2011), meaningful benchmarking against specific aspects of the observations becomes feasible in spite of the aforementioned biases in the ITRDB. Our technical framework considers four complementary aspects of the observations: (i) the size-related trend in tree-rings; (ii) diameter increment of mature trees; (iii) diameter increment of young trees; and (iv) extreme growth events. Each of these aspects formed the basis of a benchmark (Table 1): The size-related trend in diameter increment can be assessed by calculating the average virtual tree for a stand aligned by tree age (Fig. 5 a, b) and subtracting its TRWs from the simulated TRWs of the largest diameter class (Fig. 5c).
Subsequently, a linear regression is used to quantify the temporal trend in the residuals (Fig. 5d). If the simulations and observations have similar size-related trends, the temporal trend in the residuals will be close to zero. Furthermore, the root mean square error (RMSE) between the simulations and observations is calculated and normalized by the length of time series used to calculate the difference in observed and simulated growth trends. A skilled model is expected to simultaneously show no trend in the residuals and a low RMSE across many sites.
(ii) Diameter increments of mature trees. In land-surface models that account for within-stand competition, larger trees will consistently grow faster than smaller trees due to the way competition is formalized (Bellassen et al., 2010;Haverd et al., 345 2013). In reality, growing conditions can suddenly become favourable for trees that have previously been suppressed, resulting in fluctuating growth rates. This discrepancy between simulated and observed competition can be accounted for in the benchmark by using the observations to compile a virtual tree of the stand aligned by calendar year, taking the average tree diameter of all samples to construct the virtual tree ( Fig. 6 a and b). Under the assumption that the observed trees are the bigger trees from a given site, the virtual tree can be compared with the biggest diameter class from the model (see section 4.1). Given 350 that for the last decades both the quick and slowly growing trees are still alive and could have been sampled, only the growth in recent decades of the virtual tree are compared to the simulations (Fig. 6 c and d). The RMSE and trend of the residuals between the virtual tree and the largest diameter class simulated are calculated (Fig. 6d). A skilled model is expected to simultaneously show no trend in the residuals and a low RMSE across many sites.

(iii)
Diameter increments of young trees. As mentioned above, the size-related trend in diameter increment can be assessed by calculating the largest virtual tree of the stand. The maximum age of a virtual tree equals the shortest observed individual TRW record for the stand, as it represents the age intersection between the TRW records for all individuals in the stand. The largest virtual tree is thus clearly biased towards higher observed diameters, compensating for the loss of observed high diameters in the sampling approach due to the fact that the old fast-growing trees died well before sampling took place (Fig. 7  360   a and b). The first decades of growth of the virtual tree are then compared to the simulated growth of the largest diameter class ( Fig. 7 c and d) by calculating the RMSE and trend of the residuals (Fig. 7d). A skilled model is expected to simultaneously show no trend in the residuals and a low RMSE across many sites. By using different approaches to evaluate the growth of young (this benchmark) and mature trees (the previous benchmark) the comparison accounts for the observation that the drivers of ring growth change when the trees grow taller (Cook, 1985).
(iv) Extreme growth events. Even a perfect land-surface model cannot be expected to reproduce all year-to-year variation due to uncertainties in forcing data, such as the reconstructed climate and N-deposition drivers. Nevertheless, well-constrained climate reconstructions can be expected to contain extreme events, and hence a skilled model driven by well-constrained reconstructions should reproduce the statistics of the most extreme events. In this benchmark, extreme growth is defined as 370 the first and last quartiles in TRW ordered by calendar year (i.e., not aligning establishment years) and averaged over the individual trees records (Fig. 8a). The model skill to reproduce the absolute ring-width amplitude regardless of timing was tested by comparing the observed and simulated 25th and 75th percentiles of TRWs for the largest diameter class, which is the diameter class showing the strongest climate sensitivity (Fig. 8 e and f). Additionally, model skill for reproducing the timing of individual extreme growth events was tested by comparing the simulated TRW for the exact years during which extreme 375 growth was observed (Rammig et al., 2015; Fig.8 a-d). For both the amplitude and timing of growth extremes, the similarity between simulations and observations was calculated as the RMSE with the error being the distance from the 1:1 line (Fig. 8 c-f). A skilled model is expected to simultaneously show low RMSE for both the amplitude and timing of extreme years across many sites.

Test case
As the test case aimed at illustrating the four benchmarks rather than evaluating the ORCHIDEE model itself, the main objective of the test case is confirming that each benchmark indeed tests different aspects of the model's performance and that the proposed benchmarks can, therefore, be considered complementary to each other. The four benchmarks presented above were applied to a test case consisting of ten Pinus sylvestris L. sites across Europe. For each site the dataset consisted of 11 to 385 304 cores (Table S2); a leave-one-out approach was used to evaluate the sensitivity of the benchmark to individual tree records.
For the simulations, the land-surface model ORCHIDEE r5698 was used. This model versions accounts for the aforementioned minimum requirements for land surface models to mechanistically simulate TRW. The benchmarks were used on four configurations with increasing model functionality (Table 2). A more detailed description of the test case, the ORCHIDEE model and the different configurations is given in the Supplementary Information (Text S1 and S2).
The model reproduced the observed age-trends across the four configurations tested, i.e., the 'basic' set-up, the 'recru' set-up with an increasing number of individuals through recruitment (Eqs. 28 and 29 in Text S1), the 'power' set-up with a decreasing share of stand-level photosynthesis allocated to the largest size classes (Eq. 15 in Text S1), and the 'Ndyn' set-up where growth may be limited by additional nitrogen requirements (Eq. 32 in Text S1). Increasing the functionality of the model did not help 395 to reduce the RMSE of the overall age trend but did improve the match (i.e. the slope) between the simulated and observed age trends at 8 of 10 sites (Fig. 9a). For mature trees, the additional functionality acted as an offset, and hence growth rate estimates of the largest trees containing the biggest share of stand biomass improved at sites where diameter growth was underestimated (indicated by the trend in the residuals). Diameter growth rates were, however, further overestimated at sites where this was already the case (Fig. 9b). The RMSE of the diameter of mature trees (Fig. 9b, x-axis) were negatively correlated 400 with their growth rate (Fig. 9b, y-axis) (ρ = -0.5, p < 0.01), suggesting that little new insight is to be expected from using both metrics from this benchmarks (Table 1) compared to simply selecting one.
Increasing model functionality had a much smaller effect on early stand developmental stages, as shown by the much smaller absolute changes in both RMSE and trends in the residuals for the young trees (Fig. 9c) compared to the old trees (Fig. 9b). 405 Interestingly, considering feedbacks between the nitrogen and carbon cycles (configuration 'Ndyn') increased the RSME for forests located in regions with relatively low nitrogen deposition loads (Fig 9c, x-axis). When the dynamic nitrogen cycle is used, the carbon-to-nitrogen ratio in the leaves increases with increasing age whereas it remains constant for the open nitrogen cycle used in the other three configurations. This simulated age-related dynamic for leaf nitrogen in combination with its initial value contributes to the overestimation of the TRW for young trees. Finally, adding functionality to the model did not help to 410 enhance the model skill in terms of simulating extreme growths which is not a surprise because the added functionalities are not expected to improve the climate sensitivities of ORCHIDEE (Fig. 9d).
The wealth of approaches available for modelling tree ring growth (see the introduction for a summary) has been largely 415 overlooked by the global land-surface community and until now, benchmarking land-surface models against ring-width records still relies mostly on interannual variation in the simulated net primary productivity as a proxy for TRW (Klesse et al., 2018;Kolus et al., 2019;Rammig et al., 2015;Zhang et al., 2018). Although such an approach is valid to benchmark the capability of land-surface models to simulate interannual variability, the observations will need to be detrended to remove the size-related growth signal, adding considerable uncertainty to the benchmark (Bunde et al., 2013;Cedro, 2016;Nicklen et al., 2019;Stine, 420 2019). Moving beyond the net primary production proxy by explicitly simulating TRW enriches the benchmark since potentially confounding factors including climate responses, forest structure, age and size trends (Alexander et al., 2018;Nickless et al., 2011), as well as sampling biases (Babst et al., 2014a), can be better accounted for. Whereas previous studies had to rely on a single qualitative benchmark, i.e., interannual variability, our method shows that at least four benchmarks, each of them defined by two metrics, are available for models that meet the minimal requirements to simulate TRW determined 425 by Cook's aggregate tree growth model (Fig. 2). Irrespective of the model approach, the largest archive of tree-ring records that is freely available to the land-surface community, i.e., the ITRDB, is notorious for its sample biases (Klesse et al., 2018;Zhao et al., 2019). Although it may be difficult to correct the data for these biases, we propose two solutions for comparing LSM output to biased observations. 430 Simulating a size-structured population of trees enables comparing the observations relative to a benchmark tall simulated tree, which compensates for the tendency of dendroclimatic samplers to select the oldest trees in a stand (which often turn out to be the larger trees). Although the ITRDB does not contain the site metadata that would be required to make this comparison exact, i.e., the diameter and true age distribution of the sampled stand, it protects against comparing extreme samples to mean simulations. The second solution relies on the observation that the variation due to size-related growth by far exceeds the 435 variation due to environmental changes and helps to constrain the survivor bias which is derived from the growth of young fast-growing trees that died a long time ago and are therefore absent from records made from present day sampling of old growth forests. The benchmarks proposed here provide a tool to start using TRWs as a much-needed large-scale constraint on the maximum tree diameter and annual growth for the transition from pre-industrial to present-day environmental conditions.

440
Combining these two solutions with targeted processes resulted in four benchmarks, each of them defined by two metrics (Table 1). The test case demonstrated that these benchmarks were largely independent in terms of their information content and thus resulted in a rich and rather refined description of the remaining model deficiencies (Fig. 9). The novel benchmarks proposed here provide new targets for evaluating land-surface models' performance as each of the eight metrics could be used in the objective function of any data assimilation technique to rigorously account for the information contained in TRW 445 records.
Tree-ring records thus should complement well-established but short-term benchmarks for land-surface models (Randerson et al., 2009) activity (Chen et al., 2011;Demarty et al., 2007). The value of tree-ring records can be further enriched by developing new and unbiased networks to complement the ITRDB, adding their stable isotope ratios (Levesque et al., 2019;Barichivich et al. in prep,) and combining their use with high-frequency but short-term eddy covariance measurements (Curtis et al., 2002), experimental data from plant growth under pre-industrial CO2 concentrations (Temme et al., 2015), and proxies of atmospheric composition (Campbell et al., 2017). 455

Code and data availability
In line with GMD requirements, the model code has been archived and made accessible: https://doi.org/10.14768/20200228001.1. The scripts required for reproducing the figures, the ORCHIDEE simulations and intermediate results are available at https://github.com/j-jeong/J.Jeong_GMD_2020. BACI dataset is freely available in online: 460 http://www.baci-h202i0.eu/ but requires registration by email.

Competing interest
The authors declare that they have no conflict of interest.

author contribution
Proposed benchmarks are the outcome of discussions between JJ, JB, PP, VH, and SL. JJ ran the model, analysed the output and prepared figures. FB collected and shared BACI data. JJ and SL wrote the first version of the manuscript, all authors contributed to revising and editing the final manuscript.  Table 1. Characteristics of four benchmarks making use of ITRDB records. These benchmarks were designed to better 480 constrain physiological and ecological processes in land-surface models. Given the use of the ITRDB data, the benchmarks had to propose solutions for well-known issues of the ITRDB (Table S2).

Figure 4.
Comparison of bias for each of the benchmarks introduced by sampling the 15% biggest trees rather than the average tree (red; observational based) and the bias introduced by reporting the largest simulated diameter class rather than the average tree (blue; model based). Error bars are the standard deviation of the 27 coniferous sites for the BACI data set (red) or the 40 model configurations (4 model 520 different set-ups for 10 sites each).

Figure 5. Illustration of the major steps in calculating the metrics of the benchmark for the size-related trend in diameter increment.
The size-related trend in diameter increment can be assessed by calculating a time series for the average ring width after aligning the age of 525 the individual trees (a, b). Observations are shown as grey lines and simulation as blue lines. The biggest class is presented by the bold line.
The black dotted lines represent the virtual tree based on the observations. The TRWs of this virtual tree are then subtracted from the simulated TRWs of the largest diameter class (c). Subsequently, a linear regression is used to quantify the temporal trend in the residuals (d). The green line denotes the model residuals and the green dotted line is the linear regression of the model residuals. Furthermore, the root mean square error (RMSE) between the simulations and observations is calculated (not shown) and normalized by the length of time series 530 to calculate the difference in observed and simulated growth trends. Observations and simulation are from site finl052 (NOAA, 2020a) (Table S2).
https://doi.org/10.5194/gmd-2020-29 Preprint. Discussion started: 14 April 2020 c Author(s) 2020. CC BY 4.0 License. Figure 6. Illustration of the major steps in calculating the metrics of the benchmark for the diameter increment in mature trees.
Individual tree records are ordered by calendar year and for each year the average observed diameter is calculated (a). Observations are 535 shown as grey lines and simulation as blue lines. The biggest class is presented by the bold line. Black dotted lines represent the yearly average of observations. Note that X-axis in Fig. 6 is different from Fig. 5. Under the assumption that the observed trees are the biggest trees from a given site, the virtual tree is compared with the biggest diameter class from the model (b) and the RMSE is calculated (c). Given that for the most recent decades both the fast and slow growing trees are still alive and could have been sampled, only the recent decades of the virtual tree growth are compared to the simulations. The RMSE (grey arrows) and trend (not shown) of the residuals between the virtual tree 540 and the largest diameter class simulated are calculated (d). Observations and simulation are from site finl052 (NOAA, 2020a) (Table S2).

545
Black dotted lines represent the yearly maximum of the observations. The growth of the virtual tree is then compared to the simulated growth of the largest diameter class (b) by calculating the RMSE (c) and trend of the residuals (not shown). These calculations are limited to the first decades of the time series (d) to compensate for the bias caused by the fact that the old fast-growing trees died well before sampling took place. By using different approaches to evaluate the growth of young (this benchmark) and mature trees (the previous benchmark) the comparison accounts for the observation that the drivers of ring growth change when the trees grow taller (Cook, 1985). Observations and 550 simulation are from site finl052 (NOAA, 2020a) (Table S2).
https://doi.org/10.5194/gmd-2020-29 Preprint. Discussion started: 14 April 2020 c Author(s) 2020. CC BY 4.0 License. Figure 8. Illustration of the major steps in calculating the metrics of the benchmark for extreme growth events. In this benchmark, extreme growth is defined as the first and last quartiles in TRW ordered by calendar year and averaged over the individual trees records (a).
Red shaded area and ticks represent observations exceeding the 75 percentile and blue shaded area and ticks represents observation below 555 the 25 percentile (a). The TRW simulated for the largest diameter class are then extracted for the years identified in a (b). Both observations https://doi.org/10.5194/gmd-2020-29 Preprint. Discussion started: 14 April 2020 c Author(s) 2020. CC BY 4.0 License. and simulations were normalized to remove the difference in the range of values between configurations. These normalized values correspond to the X and Y axis in (c) and (d) for observation and simulation, respectively. Subsequently, the similarity between simulations and observations was tested by calculating the distance from the 1:1 line (shown in green in d), which is equivalent to the RMSE for years with extreme growth (d). An additional metric is calculated in a similar way but by using both the 25% and 75% extreme values of the 560 simulation and observation regardless of the year (e, f). This test identifies if the simulation can reproduce the amplitude of TRW. To assess the absolute amplitude, the observation and simulation were not normalized. To avoid possible uncertainties from using reconstructed climate forcing, both metrics are limited to the past five decades for which climate observations are available. Observations and simulation are from site spai006 (NOAA, 2020c) (Table S2). Figure 9. The proposed four benchmarks (a-d) applied to a European test case of 10 sites (Table S2 to link numbers to site names and locations). Each colour represents a configuration of the land-surface model ( Table 2 details the configurations), where black denotes configuration Ndyn, blue Recru, red Power and green the configuration labelled Basic. The green-X marks indicate the desired outcomes.
(a) Benchmark for the trend in tree-ring width. The X-axis shows the RMSE of the difference between simulated and observed trend (Fig.   570 5b) and the Y-axis shows using the slope of the temporal trend in the residuals (Fig. 5d). (b) Benchmark for diameter increment of mature trees. The X-axis shows the RMSE of the difference between simulation and averaged observation aligned by calendar year for matured trees (Fig. 6d), and the Y-axis shows the slope of the temporal trend in the residuals. (c) Benchmark for diameter increment of young trees.
The X-axis shows the differences between simulation and averaged observation aligned by age of trees for young trees (Fig. 7d), and the Y- axis shows the slope of the temporal trend in the residuals. (d) Benchmark for climate sensitivity. The X-axis shows the RMSE of the 575 difference between the observed and simulated tree ring widths for years in which the observed tree ring width was extreme. The Y-axis shows the RMSE of the difference between the observed and simulated extreme tree ring widths irrespective of the year they occur.