CARDAMOM-FluxVal version 1.0: a FLUXNET-based validation system for CARDAMOM carbon and water flux estimates

Land–atmosphere carbon and water exchanges have large uncertainty in terrestrial biosphere models (TBMs). Using observations to reduce TBM structural and parametric errors and uncertainty is a critical priority for both understanding and accurately predicting carbon and water fluxes. Recent implementations of the Bayesian CARbon DAta–MOdel fraMework (CARDAMOM) have yielded key insights into ecosystem carbon and water cycling. CARDAMOM estimates parameters for an associated TBM of intermediate complexity (Data Assimilation Linked Ecosystem Carbon – DALEC). These CARDAMOM analyses – informed by co-located C and H2O flux observations – have exhibited considerable skill in both representing the variability of assimilated observations and predicting withheld observations. CARDAMOM and DALEC have been continuously developed to accommodate new scientific challenges and an expanding variety of observational constraints. However, so far there has been no concerted effort to globally and systematically validate CARDAMOM performance across individual model–data fusion configurations. Here we use the FLUXNET 2015 dataset – an ensemble of 200+ eddy covariance flux tower sites – to formulate a concerted benchmarking framework for CARDAMOM carbon (photosynthesis and net C exchange) and water (evapotranspiration) flux estimates (CARDAMOM-FluxVal version 1.0). We present a concise set of skill metrics to evaluate CARDAMOM performance against both assimilated and withheld FLUXNET 2015 photosynthesis, net CO2 exchange, and evapotranspiration estimates. We further demonstrate the potential for tailored CARDAMOM evaluations by categorizing performance in terms of (i) individual land-cover types, (ii) monthly, annual, and mean fluxes, and (iii) length of assimilation data. The CARDAMOM benchmarking system – along with the CARDAMOM driver files provided – can be readily repeated to support both the intercomparison between existing CARDAMOM model configurations and the formulation, development, and testing of new CARDAMOM model structures.


Introduction
Terrestrial biosphere models (TBMs) are a key tool to understanding and resolving the state of terrestrial ecosystems and their sensitivity to climate. Of particular importance are land-atmosphere CO 2 fluxes, as the land biosphere is currently a net sink absorbing nearly a third of anthropogenically emitted CO 2 . However, despite the importance of TBMs in understanding the role of terrestrial ecosystems in the Earth system, model structural uncertainty and parametric uncertainty remain major sources of error and bias impacting terrestrial carbon cycle modeling (Bonan et al., 2019;Quetin et al., 2020), presenting a major challenge to robust prediction of the magni-of the ecosystem processes of carbon, water, and energy exchanges from and to the atmosphere can improve empirical modeling or data-driven predictions of the key components of the land surface and Earth system and reduce uncertainties (Jung et al., 2020(Jung et al., , 2019Reich, 2010;Tramontana et al., 2016). Model-data fusion (MDF) approaches merging terrestrial biosphere models with observations (Fox et al., 2009;Hill et al., 2012;Keenan et al., 2012;MacBean et al., 2016;Xiao et al., 2014) improve biogeochemical model accuracy and skill by incorporating data from field-based measurements and satellite-based remote sensing observations as well as their associated uncertainties into model calibration. MDF hence offers a much-needed capability to reconcile uncertain model processes with the ever-increasing volume of Earth observation datasets (Caldararu et al., 2012;Quetin et al., 2020;Richardson et al., 2011;Rowland et al., 2014;Smallman et al., 2017). Specifically, data-constrained processes should improve the accuracy of estimates of global plant and soil C dynamics, as well as their exchanges with each other and with the atmosphere, and enable quantification of their uncertainty (Bloom et al., 2016). MDF representations of terrestrial ecosystem C cycling combines the advantage of having a process-based, mathematically refined expression of the ecosystem C budget and parameter estimation that takes external constraints with their uncertainties into consideration. Contingent on the accuracy of a particular model's C cycle mechanisms, MDF can improve simulation results -relative to both assimilated datasets and withheld data from validation -due to improved parameter estimates of biogeochemical processes that may be introduced or influenced by external forcing .
The CARbon DAta-MOdel fraMework (CARDAMOM) MDF system approach has been applied to a range of scales and with a wide range of in situ and satellite datasets to (i) constrain terrestrial C cycle states and processes within a Bayesian model-data fusion framework and (ii) diagnose these analyses to address questions or test hypotheses on the current and evolving state of the terrestrial C balance (Bloom et al., 2016;Smallman et al., 2017;Yin et al., 2020;Exbrayat et al., 2019;Quetin et al., 2020;Bloom et al., 2020, amongst others). The Data Assimilation Linked Ecosystem Carbon (DALEC; Williams et al., 2005) model is a key component of the CARDAMOM framework describing the ecosystem carbon and water cycles. The DALEC model has multiple versions varying in structural complexity and process representation (Famiglietti et al., 2021), including alternate forms of climate sensitive phenology (Smallman et al., 2017), timedependent autotrophic respiration processes (Rowland et al., 2014), an array of hydrological representations (Bloom et al., 2016;Bloom and Williams, 2015;Exbrayat et al., 2019;Fox et al., 2009;Quetin et al., 2020;Rowland et al., 2014;Smallman and Williams, 2019;Spadavecchia et al., 2011), expanded representation of heterotrophic respiration sensitivity to climate, and explicit representations of ecosystem-level water use efficiency  among other model structures.
Invariably, observations play a critical role in (i) informing uncertain processes and reducing model error, (ii) providing a quantitative metric for validating model performance, and (iii) motivating subsequent model process representations. In particular, FLUXNET -an ensemble of C and H 2 O flux estimates from 200+ eddy covariance flux tower sites -has been instrumental in the calibration and validation of land surface models . As one of the most complete and sophisticated field-based databases of land surface fluxes, FLUXNET provides gap-filled measurements of tower-based micrometeorology and eddy covariance estimates of exchanges of carbon dioxide, water vapor, and energy between the biosphere and atmosphere (Schwalm et al., 2010;Pastorello et al., 2020). With the increasing availability (in terms of both spatial coverage and record length) of eddy covariance measurements over participating FLUXNET sites, data-driven methods, or data assimilation models, have become popular and delivered progressively more accurate retrieval results with the aid of remote sensing data for large-scale studies (Anderson et al., 2007;Gonsamo et al., 2012;Velpuri et al., 2013). Gross primary productivity (GPP) and net ecosystem exchange (NEE) are two of the key fluxes in the terrestrial C cycle related to plant growth and the net C sink through vegetation, but they are difficult to measure due to the complications between processes in the biosphere (Gilmanov et al., 2003;Wang et al., 2006). Evapotranspiration (ET) is another key measure related to water, energy, and carbon fluxes quantifying the combined process of transpiration, soil evaporation, and canopyintercepted rainfall evaporation. The FLUXNET dataset in its entirety is particularly well suited for benchmarking and validating CARDAMOM C and H 2 O flux estimates, and a number of CARDAMOM-DALEC implementations across FLUXNET sites have demonstrated the scientific and technical merits of assimilating and predicting withheld observations (Bloom and Williams, 2015;Famiglietti et al., 2021;Smallman et al., 2017).
Overall, systemically challenging existing CARDAMOM model structures against observations -and using these outcomes to formulate new model structures -is a necessary process for advancing understanding and prediction of terrestrial C and H 2 O fluxes. Among some of the key questions motivating CARDAMOM model-data fusion development decisions are the following: when trained with observations, do CARDAMOM models improve representation of principal carbon and water dynamics across terrestrial ecosystems? Which CARDAMOM model structures or model-data fusion configurations exhibit optimal predictive skill against withheld flux observations? For a given CARDAMOM model structure, is the predictive skill constant, regardless of the training or prediction window, or the length of calibration period correlated with prediction error? Which model parameters or processes are key to the improvement of predictive skill? These questions have continually motivated -and will continue to motivate -the development of CARDAMOM model structures and associated model-data fusion configurations. Consequently, systematic and easily repeatable evaluations of CARDAMOM outputs against a broad set of C and H 2 O flux observations would amount to an indispensable strategy for supporting CARDAMOM model developments.
Here, we present CARDAMOM-FluxVal version 1.0, a concerted FLUXNET-based validation framework to support a global evaluation of CARDAMOM model-data fusion approaches. CARDAMOM-FluxVal provides a validation test bed for benchmarking CARDAMOM model structures against FLUXNET 2015 GPP, NEE, and ET datasets. To demonstrate the operation of the validation framework, we present quantitative assessments of the performance of two example CARDAMOM model configurations -one solely trained by satellite and inventory datasets and the other trained with an additional constraint using observations from FLUXNET sites. The methodology is described in Sect. 2. In Sect. 3, we present a concise set of validation metrics (against assimilated and withheld FLUXNET observations) and further evaluate performance sensitivity to the choice of constraining variables, temporal length of data assimilation, and particular land-cover types. Finally, in Sect. 4 we summarize the strengths and limitations of our CARDAMOM validation approach and outline its potential applications for (i) benchmarking and intercomparing current and future CAR-DAMOM configurations, and (ii) we provide recommendations and guidance to conduct scientific investigations.

Methods
The method section includes descriptions of the CAR-DAMOM implementation across FLUXNET 2015 sites (Sect. 2.1), satellite and inventory-based observations used for assimilation (Sect. 2.2), and the statistical measures used in model validation and extended evaluations (Sect. 2.3).

CARDAMOM implementation across FLUXNET 2015 sites
The components needed to configure CARDAMOM at each FLUXNET site namely include (a) time series of meteorological forcing variables for the DALEC model, (b) a collection of observational constraints on DALEC states and fluxes, and (c) additional attributes relating to CARDAMOM prior probability and likelihood functions . At each site, we built stand-alone CARDAMOM "driver" files, which consist of (i)  (Table S5). The aforementioned datasets amount to baseline datasets for the entire CARDAMOM-FluxVal (version 1.0) system. The CARDAMOM-FluxVal driver files are available in the Supplement (Table S6). At each FLUXNET site, we used CARDAMOM Bayesian model-data fusion methodology  to calibrate the DALEC model parameters and initial conditions and to validate DALEC model simulations against a subset of withheld data. In particular, the observations assimilated into CARDAMOM were used to optimize DALEC model parameters and initial conditions in order to statistically minimize model-data mismatches. The observations withheld from CARDAMOM were used to validate DALEC carbon and water fluxes outside the training window, i.e., in the absence of data constraints. Depending on the scientific or technical objectives, the CARDAMOM-FluxVal analyses can be configured to exclude any subset of FLUXNET or ancillary data for validation purposes. To exemplify both the assimilation and validation aspects of CARDAMOM-FluxVal, we opted for two distinct CARDAMOM configurations (Fig. 1).
-CARDAMOM analysis A1. The CARDAMOM DALEC model is constrained by the first 50 % of FLUXNET data at each site; 50 % of FLUXNET data are withheld for validation.
-CARDAMOM analysis A2. The CARDAMOM DALEC model is constrained by 0 % of FLUXNET data at each site; 100 % of FLUXNET data are withheld for validation.
In both A1 and A2, we used the same ancillary data (satellitebased leaf area index, biomass), cost function configurations, and DALEC model version. For the sake of brevity, the cost function and DALEC model version are described in the Supplement. To configure the A1 scenario, we split the FLUXNET data from each of the site into two periods based on data acquisition time for tower sites with valid observations for the study period from 2001 to 2015.

Observations
A common set of observations is assimilated into both the A1 and A2 analyses; these consist of (1) Table 1. Monthly-based residuals in assimilation and prediction windows (Fig. 1). (ABGB) in 2015 produced from a combination of field plots, airborne lidar, and satellite data using the machine learning approach (Yu, 2013). To find corresponding mapped values that match FLUXNET data measurements, we aggregated the mapping products (MODIS LAI and ABGB) from their original resolutions to 1 km spatial resolution and extract LAI and ABGB values at all FLUXNET locations. For the A1, we also included the gap-filled monthly flux measurements from the FLUXNET 2015 dataset (Pastorello et al., 2020) that includes ecosystem-scale data on CO 2 , water, and energy exchange between the biosphere and the atmosphere, as well as other meteorological and biological measurements collected at sites from the multiple regional flux networks (https://fluxnet.fluxdata.org/data/ fluxnet2015-dataset/, last access: February 2020). We used all 204 CC-BY-4.0 (Tier-1) sites to study the data assimilation using GPP, NEE, and ET together as inputs (Table S1). The pre-processing of FLUXNET tower measurements includes a quality check to filter out bad-quality monthly data and the removal of data points where the recorded measurements show constant values throughout the observational period.

Summary metrics and extended validation
Our summary metrics consist of GPP, ET, and NEE evaluated on a monthly basis, annual basis, and at site level. We selected four statistical metrics to evaluate the model accuracy, parameter correlations, and residuals (Table S2). The Pearson's linear correlation coefficient (R) is the ratio of co-variance between the modeled simulations and observations to the product of standard deviations from model simulations and observations (0<R<1 represents a positive correlation between model output and observed values, while −1<R<0 means the model outputs have a negative correlation between model output and observed values). The Nash and Sutcliffe model efficiency (MEF) quantifies the model's predictive capacity (Nash and Sutcliffe, 1970;Tramontana et al., 2016). A 0<MEF<1 indicates that the model's predictive capacity is better than the mean of observations, with a value of 1 meaning perfect predictions, while MEF<0 means the mean values of the observations are better than the model predictions. Bias is defined as the mean of the residual values for model predictions and observed data. A value of bias near zero indicates an unbiased estimation for model predictions. The root mean square error (RMSE) is the square root of the average over squared residuals (prediction errors), and the model predictions are more accurate when RMSE is closer to 0.
For the extended evaluation, we grouped the FLUXNET 2015 sites within six time window categories: data with 1 : 1 assimilation-to-prediction time ranges spanning <1 year, 1-2 years, 2-3 years, 3-4 years, 4-5 years, and >5 years (all time ranges are either assimilation or prediction lengths). The number of sites varies from 17 to 67 for different categories (Table S3), with the most sites (67) having the range of 1-2 years and the fewest sites (17) having the range of 4-5 years. We evaluated CARDAMOM performance across 12 land-cover types that comprise the FLUXNET 2015 sites included in this study (Table S3). In summary, ENF (evergreen needleleaf forest) and GRA (grasslands) have more than 30 tower sites, while SNO (snow) and CSH (closed shrublands) have only one and two sites globally. Assuming that the CARDAMOM model has valid outputs for GPP, NEE, and ET across different land-cover types, we evaluated the influence of land-cover types on the prediction accuracies.
We tested the importance of model parameters for the retrievals of GPP, NEE, and ET by calculating each parameter's correlations with the model residuals. A total of 36 model parameters (model description in the Supplement) were tested and attributed to six groups based on their relative contributions to different biophysical processes (Table S4). We tested the correlations between model parameters and retrieval residuals using the R metric for independent validation datasets.

Summary metrics for CARDAMOM FLUXNET validation
We found good agreements between median model outputs from CARDAMOM-DALEC and site-based FLUXNET ob-servations (GPP, NEE, and ET; Fig. 2) for the A1 scenario. Generally, data samples used in the assimilation window show better agreements between observations and simulations (i.e., higher MEF and lower RMSE) than the data in the prediction window. Monthly-based comparisons, due to the seasonal variation in each variable, have a wider data range than the range of site-level data. The MEF metrics show that GPP has the best simulation results in both the assimilation and prediction windows relative to NEE and ET. Furthermore, NEE presents a better MEF in the assimilation window than ET but is worse than ET in the prediction window. The same pattern is clearer in the site-level scatter plots when we only compare the long-term average observations for each FLUXNET site. In the A1 scenario, we obtained the highest MEF in the site-level comparison during the assimilation window (e.g., NEE; Fig. 2) but the lowest MEF during the prediction window, indicating that the assimilation procedure may be overfitting to the observations. The model-data residual analysis shows that it is possible to improve the cross-validated model outputs and reduce biases and structure errors with assimilation of FLUXNET observations (Figs. 3-4, S2-S4). Histograms of monthlybased residuals at the monthly timescale over all sites (Fig. 3) show that A1 gives less-biased model residuals than the outputs of A2. In general, A1 shows a positive NEE bias of 0.36 gC m −2 d −1 and negative GPP and ET biases of 0.36 gC m −2 d −1 and 0.09 gC mm d −1 , respectively, while A2 shows much larger biases (NEE bias: +1.03 gC m −2 d −1 , GPP bias: −1.34 gC m −2 d −1 , ET biases: −0.55 mm d −1 ). Annual-based distributions (Fig. S2) of model retrieval residuals show patterns similar to monthly residuals, except that A1 shows tighter distributions around zero due to the average of seasonal variations. The temporal average of site-level histograms (Fig. 4) preserves spatial characteristics of the model retrieval residuals. Unsurprisingly, A2 has more outliers than A1 at the site-level scale. Predicted absolute values (GPP, NEE, and ET) instead of residuals show a wider range of distributions (Fig. S3) for A1 than A2, suggesting that A1 runs capture more spatial and temporal variability with higher accuracies and lower biases. The comparisons of second-order distribution (standard deviation of distribution) provide additional evidence that A1 has ranges closer to the observed distributions (Fig. S4).
The constrained runs of the CARDAMOM model (A1) show substantial improvements in both matching the FLUXNET observations and reducing the model output uncertainties (Fig. 1). In other words, the added value of data in A1 -relative to A2 -leads to more accurate predictions of GPP and ET, as well as reasonable NEE. Two well-studied long-term research sites (US-Ha and US-UMB) in the United States show that the model outputs of A1 capture the stronger seasonality of NEE compared to the outputs of A2 (Fig. 1b  and c), which shows weaker seasonality patterns. Especially during the peak of growing seasons, NEE has a strong land C sink observed from tower sites, but model outputs of A2 are Figure 2. Scatter plots of CARDAMOM outputs (GPP, NEE, and ET) versus observations from FLUXNET data (A1 scenario). Scatter plots in red are results from the assimilation window, and the scatter plots in blue are for the prediction window. We plotted the data both from a monthly basis (top two panels) and at site level using the long-term averages (bottom two panels) for comparison. systematically lower in terms of C sink magnitudes. Both A1 and A2 can capture seasonal changes in GPP and ET within the model-estimated confidence intervals (CIs). However, the CI bounds are significantly reduced for A1 (e.g., the 90 % CI bound of ET from A2 is ∼ ±2.5 mm d −1 during the peak growing seasons, and it is reduced to ∼ ±1.5 mm d −1 for A1 at the selected US tower sites) due to the data assimilation process using site-level observations.

Extended assessment of CARDAMOM performance
The CARDAMOM-simulated fluxes are more sensitive to certain ecosystem parameters than others (Fig. 5). Results show that the modeled GPP is mostly correlated with the model parameters C 1 (canopy efficiency), A 1 (autotrophic respiration), and W 1 (underlying water use efficiency; see the Supplement for parameter details); these three parameters stand out as they are positively related to GPP variation with Pearson's R greater than 0.1, while the R values Table 2. Annual-based residuals in assimilation and prediction windows (Fig. 2  for all other parameters are near zero. For the NEE output, parameter I 6 (soil organic carbon -SOC) is the most negatively correlated factor with NEE, and parameter T 6 (soil organic matter -SOM -turnover rate) is the most positively correlated. However, none of the R values for NEE have a magnitude >0.1. The output of ET is also correlated with three parameters: W 1 (underlying water use efficiency), W 2 (runoff coefficient), and W 5 (radiation coefficient), with W 1 being negatively correlated with ET and the other two positively correlated. All three parameters stand out as substantially different from all other model parameters, indicating the crucial impact of these parameters on the ET output. As expected, the A1 experiment shows reduced uncertainty in a few estimated parameters when compared to the A2 experiment, indicating that the additional use of observational data imposes constraints on model parameters as well (Fig. S5). Based on the major land-cover types classified at the FLUXNET tower sites, we investigated the effects of land cover on the performance of CARDAMOM model retrieval. Results show that the forest types, except the evergreen broadleaf forest, generally have more accurate predictions than non-forest types (Fig. 6). The three major types of forests -deciduous broadleaf forest (DBF), evergreen needleleaf forest (ENF), and mixed forest (MF) -all have high R (>0.8) and MEF (>0.6) values. The relatively small uncertainty ranges (<0.1 for R) also indicate the stable per-  Table 3. Site-level residuals in assimilation and prediction windows (Fig. 3).

Residuals
GPP ( formance of these forest types. The evergreen broadleaf forest (EBF) in the tropics, though fewer sites are available (half of DBF and one-third of ENF), exhibits the difficulties in retrievals with lower performance values and higher uncertainty ranges. For non-forest sites, the retrieval accuracy varies from site to site (Fig. 6) and has large uncertainties. In particular, savannas, woody savannas, and closed shrublands are the three land-cover types showing the least accuracy and highest uncertainty, significantly in the NEE and ET retrievals (with R ∼ 0.6 and MEF being negative). Other herbaceous vegetation types, including grasslands and crops, have generally better retrievals than spatially heterogeneous land-cover types, such as savannas, but are not as good as retrievals over extratropical forests (Smallman and Williams, 2019).
The FLUXNET dataset has various lengths of observations in time (Table S3). Separating the results by the length of assimilations, we show that the CARDAMOM model has slightly better predictions of GPP, NEE, and ET when the assimilation period is longer (Fig. 7). The metric MEF for GPP and NEE increases from values below zeros to the maximum positive when the assimilation period reaches 4-5 years. The median of MEF of ET always stays positive, but also has a maximum value at the length of 4-5 years for data assimilation. Meanwhile, the R values show relatively small changes for different lengths of data assimilation, and most values are above 0.8, indicating reasonable assimilations for GPP, NEE, and ET in general. There is a slightly degraded performance in R (a decrease by <0.1) and MEF (a decrease by 0.2-0.3) for the longest assimilation period (>5 years), probably due Table 4. The bias, MEF, R, and RMSE of GPP (unit: gC m −2 d −1 ), NEE (unit: gC m −2 d −1 ), and ET (unit: mm d −1 ) assimilation versus the flux tower data for different land-cover types. Text in bold shows the land-cover types that have the most accurate predictions.

LC
Bias MEF  R  RMSE   GPP  NEE  ET  GPP  NEE  ET  GPP  NEE  ET  GPP  NEE Table S2.
to the increased size of FLUXNET sites, resulting in the inclusion of certain sites (e.g., tropical forests and/or woody savannas) with known bad performances compared to others. For the sites with record lengths of 2-3 years, the percentage of the non-forest plant functional type (PFT; grassland) is higher than other year ranges. The lack of non-forest sites could possibly be the cause of the worst performance for this length of observations. With long assimilation windows, there is also a general trend of reduced uncertainty for both NEE and ET predictions. GPP has a reduction in uncertainty for longer training windows until 4-5 years and increases for the longest assimilation period (>5 years).

Assessing CARDAMOM performance
The FLUXNET-based validation approach has provided some key insights on the skill of CARDAMOM-based C and H 2 O flux estimates.
(1) The data assimilation using FLUXNET inputs (A1) captures missing seasonal variations in the original model with lower biases and less uncertainty compared to the model solely constrained by satellite and inventory datasets (A2).
(2) The increased lengths of data assimilation can progressively improve the model performance and reduce the predictive uncertainties in all tested flux variables.
(3) Land-cover types still exhibit influences on Figure 6. Box plots of correlation metrics (R and MEF) for CARDAMOM outputs (GPP, NEE, or ET) versus FLUXNET tower measurements with different land-cover types (A1 scenario, prediction window). The full names of land-cover types can be found in Table S3. The number in parentheses (x axis) indicates the total available tower sites for each land-cover type. the model prediction accuracy, even though the parameters were locally adjusted in the assimilation process, consistent with earlier studies using global parametrization (Smallman and Williams, 2019). (4) Certain parameters (i.e., C 1 , A 1 , and W 1 ) show more distinct correlations with model outputs, suggesting that improved prior constraints on a subset of parameters could further improve the retrieval accuracies of the corresponding outputs. (5) The validation results also highlight the fact that more work should be focused on tropical vegetation, with both the humid forests and savanna regions exhibiting the worst performance; the lack of regular seasonal cycles may also hamper accurate retrievals for CAR-DAMOM and other models (Quetin et al., 2020). The aforementioned insights are key for identifying seasonal and interannual limitations in CARDAMOM model performance, limitations (or lack thereof) in the ability of CARDAMOM model structures to predict C and H 2 O fluxes on a range of timescales, and limitations of CARDAMOM across specific biomes or land-cover types. The results can be further used to target future CARDAMOM model devel-opments towards identifying weaknesses in improving predictive skill. With more spatially explicit products becoming available for assimilation into CARDAMOM -such as satellite-based constraints on GPP and NEE (Quetin et al., 2020) -this study based on FLUXNET sites can also provide a quantitative characterization of CARDAMOM model structure.

Limitations of FLUXNET validation approach
One noteworthy caveat is the spatial resolution representation errors in the DALEC meteorological forcing. Specifically, the meteorological data used in this study are from the ECMWF ERA-Interim dataset projected at a 0.5 • resolution. The disagreement in spatial resolution may be a confounding factor for CARDAMOM FLUXNET predictions. Implementing CARDAMOM using a finer-resolution meteorological forcing will help to reduce the uncertainty caused by spatial ambiguities (see Supplement Sect. S2 for replacing meteorological forcing data). Potential approaches for future versions of CARDAMOM-FluxVal include (i) using gap-filled products from FLUXNET sites to configure CARDAMOM simulations and/or (ii) transitioning to ERA5 meteorological forcing. However, the current version has not rigorously tested the new meteorological forcing datasets. And the improvement of all drivers to a finer resolution requires modification of other ancillary datasets that are used to determine variables such as CO 2 concentration, burned area, and vapor pressure deficit (VPD) (Table S6), which is an ongoing effort for the new CARDAMOM version.
We also note that scarcity of tropical tower sites across the FLUXNET 2015 dataset (Schimel et al., 2015) may ultimately lead to biased assessments of CARDAMOM model structures. The possible heterogeneity for non-forest tower sites also causes more uncertainty in observed variables as well as the meteorological forcing due to resolution issues. On the other hand, our PFT-level analysis could also reveal potential model structure limitations in simulating certain PFTs with reasonable assumptions, which needs further attention when the caveat due to observational uncertainty is ruled out. While we advocate for the use of global summary metrics to assess model structure, we also recommend that users of this validation approach recognize the variable representation of biomes and vegetation classes in the available observational datasets. In addition to extended analyses (Sect. 3.2), we also recommend projecting validation assessments into climate space (Reichstein et al., 2003).

Applications
The summary metrics (Sect. 3.1) provide an easily reproducible set of statistics for the validation framework for monthly and interannual CARDAMOM carbon and water flux estimates. While our results show the importance of observational constraints (in this study, FLUXNET data), the CARDAMOM validation system can be readily applied to test additional configurations (alternative models, cost function parameters, datasets assimilated, and assimilationprediction configurations). With a number of parametric and structural variations in existing CARDAMOM framework model structures (Famiglietti et al., 2021) -as well as anticipated variations among ongoing CARDAMOM developments -we highlight the need for a concerted and easily repeatable validation system. In particular, we recommend the use of the CARDAMOM-FluxVal validation approach for three categories of CARDAMOM developments. 1. DALEC model structures. The growing diversity of DALEC models (Famiglietti et al., 2021) provides a unique opportunity for determining which model structures and process representations best predict assimilated or withheld carbon and water fluxes. Further investigations can also be conducted with the exclusion and/or adaptation of ecological and dynamic constraints (Bloom and Williams, 2015;Smallman et al., 2021). Models of similar complexity as DALEC can also be used.
2. CARDAMOM cost function. Model-data error characterization in the CARDAMOM multi-objective optimization approach discussed in Bloom et al. (2020) is inherently limited. The FLUXNET validation approach can be used (i) for quantitative characterization of DALEC (or alternate model) accuracy and precision based on error characterization choices and (ii) to test potential improvements in error characterizations, such as optimizable uncertainty coefficients and the error models (Norton and Uryasev, 2019; Schoups and Vrugt, 2010). These analyses can be further extended to quantify the added value of individual data streams (e.g., by sequential removal of individual observation types).
3. CARDAMOM MDF algorithms. CARDAMOM employs an adaptive Metropolis-Hastings Markov chain Monte Carlo. The validation framework can be used to quantify the effectiveness of DALEC predictions using faster methods (e.g., optimal estimation; Rodgers, 2000) or previously established optimization algorithms (Fox et al., 2009). Experiments could be expanded to include dedicated studies for comparing the effectiveness of CARDAMOM analyses against non-CARDAMOM model-data fusion efforts (Bacour et al., 2019;Liu et al., 2021;MacBean et al., 2016) and machine learning methodologies (Jung et al., 2020(Jung et al., , 2019(Jung et al., , 2017Tramontana et al., 2016).
We anticipate that the CARDAMOM FLUXNET validation framework will provide a much-needed quantitative benchmark to support and inform future CARDAMOM framework developments. Specifically, validation and inter-comparison experiments can span well beyond the two CAR-DAMOM configurations presented in this study (A1 and A2) and can be adapted to suit individual needs for CAR-DAMOM developments or scientific investigations.
Code and data availability. The CARDAMOM code used in this paper is available at https://github.com/CARDAMOM-framework/ CARDAMOM_v2.2 (last access: June 2021). CARDAMOM-FluxVal version 1.0 code and driver datasets (including the CAR-DAMOM version used in this analysis) are tagged in the GitHub link. The code, along with the full output datasets, is permanently stored in Yang et al. (2021) (DOI: https://doi.org/10.5281/zenodo. 4904195). Instructions on the code implementation are provided in the Supplement (Sect. S2).
Author contributions. AB and YY designed the research framework and performed the model validation using FLUXNET data. YY and SM tested the integrity and validity of the code and model. LX and SaS provided the global biomass data. PL, AN, NCP, JTR, JW, GRQ, TLS, and MW provided expertise on modeling and assisted with paper revision. All authors contributed to the writing of the final paper.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements. This work was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. We acknowledge Alexandra Konings from Stanford University for thorough paper feedback and review.
Financial support. This work was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.
Review statement. This paper was edited by Carlos Sierra and reviewed by two anonymous referees.