A novel model–data fusion approach to terrestrial carbon cycle reanalysis across the contiguous U.S using SIPNET and PEcAn state data assimilation system v. 1.7.2

The ability to monitor, understand, and predict the dynamics of the terrestrial carbon cycle requires the capacity to robustly and coherently synthesize multiple streams of information that each provide partial information about different pools and fluxes. In this study, we introduce a new terrestrial carbon cycle data assimilation system, built on the PEcAn modeldata eco-informatics system, and its application for the development of a proof-of-concept carbon "reanalysis" product that harmonizes carbon pools (leaf, wood, soil) and fluxes (GPP, Ra, Rh, NEE) across the contiguous United States from 19865 2019. We first calibrated this system against plant trait and flux tower Net Ecosystem Exchange (NEE) using a novel emulated hierarchical Bayesian approach. Next, we extended the Tobit-Wishart Ensemble Filter (TWEnF) State Data Assimilation (SDA) framework, a generalization of the common Ensemble Kalman Filter which accounts for censored data and provides a fully Bayesian estimate of model process error, to a regional-scale system with a calibrated localization. Combined with additional workflows for propagating parameter, initial condition, and driver uncertainty, this represents the most complete and robust 10 uncertainty accounting available for terrestrial carbon models. Our initial reanalysis was run on an irregular grid of∼ 500 points selected using a stratified sampling method to efficiently capture environmental heterogeneity. Remotely sensed observations of aboveground biomass (Landsat LandTrendr) and LAI (MODIS MOD15) were sequentially assimilated into the SIPNET model. Reanalysis soil carbon, which was indirectly constrained based on modeled covariances, showed general agreement with SoilGrids, an independent soil carbon data product. Reanalysis NEE, which was constrained based on posterior ensemble 15 weights, also showed good agreement with eddy flux tower NEE and reduced RMSE compared to the calibrated forecast. Ultimately, PEcAn’s carbon cycle reanalysis provides a scalable framework for harmonizing multiple data constraints and providing a uniform synthetic platform for carbon monitoring, reporting, and verification (MRV) and accelerating terrestrial carbon cycle research.

data assimilation studies have relied on meteorological driver uncertainty to generate ensemble spread (Fox et al., 2018) or spinups to represent initial condition uncertainty, and frequently, these studies ignore parameter uncertainty or model process error.
Here, we build on a decade of work on uncertainty propagation and analysis within the PEcAn model-data informatics system 95 (LeBauer et al., 2013;Fer et al., 2020) to generate the most complete and robust uncertainty accounting available to date . At a high-level, we use an ensemble-based approach to propagate data-driven uncertainty in model drivers, initial conditions, and parameters, with parameter distributions derived from an across-site hierarchical Bayesian calibration against Ameriflux NEE and plant trait data. In addition, within the SDA we rely on the new Tobit-Wishart Ensemble Filter (TWEnF) to dynamically update and propagate a fully Bayesian estimate of the entire process error covariance matrix. The TWEnF also 100 relaxes the strong Gaussian assumption behind most data assimilation methods, allowing for both zero-truncation (no negative values) and zero-inflation (more zeros than expected by the Gaussian), both of which are common features of ecological data.
In this study, we focus on scaling-up our previous site-level SDA  to a national-scale terrestrial carbon reanalysis. In doing so we developed, tested, and calibrated a spatial localization algorithm for our covariance matrix (Petrie and Dance, 2010), estimating a 500km distance threshold beyond which covariances are set to zero to avoid spurious correlations. 105 This system can run on a irregular spatial grid that uses cluster analyses to optimally distribute points in a multivariate environmental space (i.e., temperature, precipitation, elevation, AET, climatic water deficit) that reduces computational costs and better captures complex terrain than a regular grid. We also extended the TWEnF to calculate and save the spatiotemporally-varying posterior weights of each ensemble member, which allows us to infer sub-daily fluxes (e.g. GPP, NEE) across the simulation region even though our assimilation was run at an annual timescale. Having developed this system, we then report on our initial 110 proof-of-concept reanalysis, which was constrained by MODIS leaf area index (LAI) and LandTrendr Aboveground Biomass (AGB). Finally, we perform an initial validation against Ameriflux NEE and SoilGrids soil organic carbon.

PEcAn Platform and SIPNET model
We developed our regional state data assimilation framework within an ecological model-data informatics system, the Pre-115 dictive Ecosystem Analyzer (PEcAn) v. 1.7.2 and under a module named assim.sequential. PEcAn is a toolbox that provides a unified format and workflow for processing inputs and outputs, running models, and performing analyses for a wide range of ecological models. PEcAn tools include sensitivity and uncertainty analysis, benchmarking, and parameter and state data assimilation (LeBauer et al., 2013;Fer et al., 2018;Raiho et al., 2020;Fer et al., 2020). PEcAn also provides a robust procedure for sampling and propagating uncertainties by generating ensembles across initial condition, model parameters and 120 model drivers (soil and meteorological forcing). All methodological developments and tool improvements introduced in this paper have been added to PEcAn, which is available at(https://doi.org/10.5281/zenodo.5557914) and on Github (https://github. com/pecanproject/pecan/), as a virtual machine (https://opensource.ncsa.illinois.edu/projects/artifacts.php?key=PECAN), or as a series of Docker containers (https://hub.docker.com/u/pecan). PEcAn documentation is available through the project webpage (https://pecanproject.org). Specifically pull requests #2045, #2481, #2293, #2233, and #2066 on the pecan Github 125 repository (https://github.com/pecanproject/pecan/) contains the largest contribution of this work to the pecan project.
We used the Simplified Photosynthesis and Evapotranspiration model (SIPNET) model (Braswell et al., 2005) within the PEcAn platform v. 1.7.2 as the process model for our regional data assimilation exercise. SIPNET was chosen for this initial proof-of-concept as it is computationally efficient but provides a non-trivial representation of carbon pools and fluxes. SIPNET represents the carbon cycle as a series of pools and fluxes, where it accounts for two vegetation carbon pools (leaf and wood) and 130 a single aggregated soil carbon pool (Fig 1). To reduce the number of parameters in the SIPNET model, it makes the assumption that carbon stored in the leaves remains constant throughout the growing season and therefore, all fluxes in the carbon cycle (Fig 1) only affects the plant wood carbon pool. In addition, rather than a complex carbon allocation and phenology model, SIPNET uses a simple time-based function to model the phenology where on a specific day of the year; all leaves appear or fall off in a single time step. The simple representation of carbon cycle in SIPNET provides the opportunity to fully constrain 135 both model parameters and state variables.

Model Calibration
To efficiently capture variability in vegetation properties across the contiguous US (CONUS), we aggregated National Land Cover Database (NLCD) land cover (Homer et al., 2012) into four plant functional types (PFTs): deciduous forest, evergreen forest, mesic grassland and arid grassland/shrubland. For deciduous forest we rely on the previous calibration reported by 140 (Fer et al., 2021). The same methods were used to calibrate the remaining PFTs against a combination of plant trait data and eddy-covariance, as described below.

Priors and Trait Meta-analysis
PEcAn takes a fully-Bayesian approach to model calibration. Calibration begins with defining prior distributions for 35 parameters for the grass PFT and 32 parameters for the evergreen PFT. These priors were then updated using plant trait data in 145 the BETY-db (LeBauer et al., 2018) by performing a hierarchical Bayesian meta-analysis following the methods described in (LeBauer et al., 2013). Trait data for 7 parameters (SLA, Amax, leafC, leaf respiration rate, leaf turnover rate, root respiration rate, root turnover rate) were available from 6 to 213 studies for different parameters. Meta-analysis posteriors were then used as informative priors in the subsequent sensitivity analysis and calibration, where they help ensure the selection of biologically plausible parameter estimates.
Total sensitivity indices : Where SS 1 is the SSQ for the first parameter,SS 12 is the SSQ of the interaction between first and second parameter, and SS T is the total SSQ. Based on the results of these analyses, parameters with a contribution of at least 5% in the total variability of 165 NEE, were selected for further investigation and calibration for both conifer and Grass PFTs.

Hierarchical parameter data assimilation
The hierarchical Bayesian parameter data assimilation (HPDA) framework within the PEcAn platform v. 1.7.2 was employed to constrain the most sensitive parameters of the SIPNET model found during the sensitivity analysis. We used HPDA to calibrate model parameters for a combined mesic and semiarid grassland PFT and an evergreen forest PFT against NEE 170 observation from 6 eddy-covariance towers (Table 1). In contrast to a site-level model or single parameter single site calibration, hierarchical calibration accounts for site-to-site variability while also borrowing strength across sites during the parameter estimation procedure. Consequently, parameters are calibrated for all sites at the same time, where generally the estimated variability across sites identifies site sensitive parameters or missing processes in models (Fer et al., 2018).
Our calibration procedure closely followed Fer et al. (2018) by developing emulators of the likelihood surface at the site 175 level. To develop the emulators, we used an adaptive sampling approach (Fer et al., 2018) to generate a n dimensional grid over the parameter's space (where n is the number of parameters) and then ran the SIPNET model at the n-dimensional points (e.g., knots) in this parameter set. Model runs were driven by gap-filled tower meteorology and varied in length from siteto-site depending on available data (Table 1). For each run, we then compared the model's predicted NEE to the observed NEE (30-min timestep, no gap-filling) and calculated a Likelihood score for each knot. Observations were filtered based on 180 u * and effective sample size was corrected for autocorrelation following (Fer et al., 2018). Keeping with previous works (Fer et al., 2018;Reich and Cotter, 2015), we used an asymmetric heteroskedastic Laplace likelihood that accounts for the increase in observation error with the magnitude of the flux and error bias. In our case, these errors are larger in the positive direction (night-time respiration) than the negative. Site-level emulators were then developed by fitting a Gaussian Process (GP) model to the knots to construct a smooth n-dimensional likelihood surface in parameter space. After developing the site-level emulators, 185 the hierarchical calibration was performed by MCMC. In the hierarchical MCMC, site-level parameters are accepted or rejected using the across-site mean and covariances as priors and using the emulator to predict the likelihood, while across-site mean and covariances are updated through Gibbs sampling (Fer et al., 2018): Where µ g and τ g represent the global mean parameter vector and the site-to-site covariance matrix while, µ s,i are the i site-level mean parameter vectors for each of n sites. µ g and τ g were assigned uninformative multivariate Normal and Wishart priors, respectively.
Each calibration was run in three iterative rounds with adaptive sampling following (Fer et al., 2018), where the total number of proposed knots per round was set to p × 20 (p represents the number of parameters). In each round we used posterior mean

Regional Site Selection
As previously noted, our data assimilation system runs on an irregular grid optimized to efficiently capture environmental variability. For the current 'proof-of-concept' analysis, we identified a total of 517 sites across CONUS by first selecting 46 Ameriflux sites that would be used to validate the assimilated fluxes, followed by 471 additional potential sites based on a cluster analysis of landcover class (NCLD) and climate variables. The cluster analysis used PRISM 800m precipitation, 205 maximum and minimum temperature, and elevation, in addition to estimated actual evapotranspiration (AET), climatic water deficit (WD), total surface radiation, rain, and snow. Actual evapotranspiration and water deficit are biologically meaningful parameters that are well correlated with the distribution of vegetation types across multiple spatial scales (Stephenson, 1998).
We began by using PRISM 800m elevation and produced monthly global total irradiance using the GRASS GIS (v7.8) for CONUS (Šúri and Hofierka, 2004). It was assumed total surface radiation remained consistent throughout the study time 210 period as physical topography and solar distance change little at the decade time scale. To estimate AET, WD, rain, and snow, we collected 1 deg by 1 deg historical monthly wind vector data from CCSM4 (Danabasoglu et al., 2020) and estimated 1981-2010 monthly normals using methods reported in Morrison et al. (2019). Using the radiation and wind speed, along with maximum and minimum temperature, and precipitation, we derived climatic (not biological) AET and WD, which resulted in monthly climate normals for AET, WD, rain, and snow following methods described in Morrison et al. (2019). Next, a 215 k-means sampling approach was used to determine the number of unique bioenvironmental clusters within each USGS Level 1 Ecoregion of CONUS (Omernik, 1987). Maximum and minimum temperature, radiation, AET, WD, rain, snow, and landcover class pixels for a given ecoregion were applied to a k-means algorithm. We performed multiple cluster analyses by increasing the value of k by one until at least two clusters were no longer distinct or overlapping one another. The number of unique clusters for an ecoregion was then determined by the maximum k value of distinct/non-overlapping cluster runs.

220
Lastly, each ecoregion's k value was used as a weight to determine the number of sites per ecoregion, generating the additional 471 randomly sampled sites to run in the SDA workflow, resulting in a total of 517 potential sites. Only sites with acceptable MODIS LAI QC values were used in the SDA (000-best result possible, 001-Good, very usable (Myneni and Park, 2015)), resulting in a total of 493 sites total to be included in the SDA simulations. The primary premise of sequential State Data Assimilation (SDA) methods is that neither models nor observations are perfect, but we can reduce uncertainties by finding a common ground between model simulations and multiple observations. SDA relies on an iterative cycle between forecasts and updates/analyses that leverage Bayes' theorem to update the forecast (prior) based on new observations (Likelihood) (Fig 3). For example, for a given site and time, prior to collecting data, model forecasts are the best estimates of the state of a system. The model's forecast can be summarized in terms of a mean vector of state variables, 230 µ f , and an error covariance matrix, P f . Once observations are made, we can update this prior using the newly observed data Y (with associated observation error R) as a component of Bayes' Theorem to calculate a corrected estimate of state of a system µ a and P a . This update then serves as the initial conditions for further projections in time (Fig 4) (Dietze, 2017).
Classical data assimilation algorithms such Ensemble Kalman Filter (Evensen, 2009) have been used extensively in weather and ecological forecasting Viskari et al., 2020). However, most of these methods make strong assumptions 235 about the probability distribution of the forecasts and observations and lack the ability to estimate the model process error. For example, the Gaussian assumption in the EnKF algorithm could generate negative soil carbon estimates and the omission of process error may lead to over confidence in model forecasts.
To account for censored state variables in our process model, we used the Tobit-Wishart Ensemble Filter (TWEnF) as described in  to perform the analysis step within our data assimilation workflow (Fig 4). The forecast mean, 240 covariance, and the likelihood function were transformed to the Tobit space to account for the left or right censored state variables (e.g., carbon pools cannot be negative). In addition, the prior for our process covariance matrix was assumed to be Wishart and prior shape parameters were stored and updated at each time step. This gave us the flexibility to not only relax the Qt Model process error. Adopted from  Gaussian assumption behind the Kalman filter family of data assimilation methods but also to estimate and propagate model process error over the assimilation period.

245
The TWEnF algorithm presented in  was expanded in this study to allow for regional data assimilation and multiple types of model process error Q in this study as follows: where x f and X are the model estimates for each ensemble before and after accounting for process error, Q, respectively. literature is called the localization problem (Petrie and Dance, 2010). In this study, we set up the localization threshold to 500Km where the covariance between sites were adjusted using an exponential decay function as follows: where d is the distance matrix between all the sites and ν is the cutoff distance where covariances will be push towards zero.
More details on the localization setup can be found in Appendix A.
The new regional-scale generalization of the TWEnF analysis step was implemented in PEcAn v. 1.7.2 and fit in R statistical software (R Core Team, 2013) using the NIMBLE package (de Valpine et al., 2017). At every timestep, we ran three chains of MCMC, a total of 100,000 steps, and discarded the first 10,000 to account for burn-in. The primary variables of interest were 270 the posterior estimates of µ a and P a and the parameter controlling the covariance and shape of the process error (Q) while assuming the R is known. We reduced the total number of unknown parameters by assuming a shared process error Q among all sites.
Aboveground biomass (AGB) and Leaf Area Index (LAI) estimates were extracted from satellite imagery to act as our observational data to assimilate in the SDA algorithm. Estimates of the annual AGB and the pixel-wise standard deviation usable" (001) were removed from the LAI observation dataset and not used for assimilation (Myneni and Park, 2015). Since 280 the assimilation was run on an annual basis, with the analysis occurring on July 15th, peak LAI was used as our annual LAI metric. Peak LAI was defined as the maximum LAI value within the 95th percentile for a single site and year. Similarly, the SD of peak LAI was defined as the corresponding SD value associated with the peak LAI value for each site/year. Any LAI SD value < 0.6 was reassigned a value of 0.6 following Viskari et al. (2015).
At the end of each assimilation cycle, we adjusted the updated state variables by using the ensemble adjustment technique Our reanalysis was run with 20 ensemble members where each is associated with a different set of species parameters, 290 meteorological drivers, and initial conditions. Parameter vectors for each ensemble member were drawn from the HPDA distribution for the across-site mean, µ g . The meteorological drivers covered the period from 1986-2019 and were resampled from the 10-ensemble member ERA5 reanalysis product, which has a 3-hr time step and a resolution of 0.5625 degrees (62km).
Initial condition distributions were generated by resampling leaf, stem, and soil carbon pool data extracted from the Ameriflux Biological, Ancillary, Disturbance, and Management (BADM) database on an EPA L1 ecoregion basis.

Post hoc ensemble weight estimation
To focus our SDA on the model's state variables (carbon pools) and minimize the dimensions of X for more efficient numerical sampling in Eq 5, we only included the main carbon cycle state variables (Aboveground biomass, leaf area index and soil carbon pool) in the X matrix. Therefore, the model output variables that were not included in the X, such as carbon and water fluxes, were not adjusted during the Analysis step itself. Instead, these additional outputs were updated post-hoc for each site and time 300 step by estimating the posterior probability of each ensemble member using the following equation: Equation 10 estimates the likelihood of producing the model simulations given the posterior (analysis) state of the system.
Eq 10 provides a relative weight for each ensemble member that varies by year and site. A weighted mean and variance was then estimated for NEE, GPP, autotrophic respiration, and heterotrophic respiration using the estimated weights. These weights 305 are applied not just to the cumulative sum of the fluxes (annual totals) but to the high-frequency time series between analysis time points. This approach was validated by comparing the reanalysis NEE to the Ameriflux observed NEE at a subset of points, with a more detailed validation occurring in a follow-up paper (Morrison et al. in prep).

Soil Carbon Validation
To assess our ability to indirectly infer soil carbon from assimilation of LAI and AGB into the SIPNET model, we compared 310 our SOC estimates at all sites against the the soilGrids dataset (Hengl et al., 2017). SoilGrids is an interpolated data product at 250 m resolution produced using machine learning models by taking advantage of global soil profile information and a series of covariate data. We extracted the average soil carbon estimates from the soilGrid database down to 2m depth to compared against our reanalysis estimates.

Global sensitivity analysis and HPDA
The main purpose of the sensitivity analysis was to identify model parameters with a large influence on the behaviour of SIPNET output variables such as NEE. The estimated variability presented in Fig 5 averages the sensitivity of NEE to each parameter across multiple years and different sites, presenting just the parameters that met our 5% variance criteria. Because NEE is the difference between Gross Primary productivity (GPP) and total ecosystem respiration we expected that param-320 eters that primarily regulate modeled GPP and respiration would be the most sensitive with respect to NEE. Soil organic matter (SOM) decomposition rate, which controls heterotrophic respiration, was the most sensitivity parameter for both plant functional types. Following SOM respiration rate, photosynthetic optimum temperature (psnTOpt), photosynthetic maximum capacity (Amax), growth respiration factor, and soil respiration Q10 were found to be sensitive in both conifer and grass PFT.
Amax and psnTOpt directly contributes to the GPP calculation, while growth respiration factor contributes to autotrophic respi-325 ration and determines the fraction of GPP that is available for growth. The soil respiration Q10 and psnTOpt both demonstrate SIPNET's high sensitivity to temperature. In addition to shared parameters, the conifer PFT also showed a high sensitivity of GPP to vapor pressure deficit (VPD), reflected in the importance of the slope (dVPDSlope) and exponent (dVpdExp) in that equation, and to specific leaf area (SLA), which reflects how much new leaf area a plant can 'buy' for an given investment in leaf carbon. The grass PFT, on the other hand, showed a high sensitivity to immedEvapFrac, which is the proportion of 330 incoming precipitation that is intercepted and re-evaporated rather than being allowed to enter the soil and become available for transpiration. Sensitive parameters found in this study are similar to (Fer et al., 2018) where they estimated the sensitivity of SIPNET model parameters to NEE and latent heath flux for temperate deciduous PFT. Even though (Fer et al., 2018) used a local sensitivity analysis approach, compared the to GSA used in this study, SOM respiration rate, psnTOpt, soil respiration Q10 335 and slope of VPD-photosynthesis (dVPDSlope) were found sensitive in both studies. Specifically, the relative importance of parameters in simulating NEE by SIPNET in grass PFT is similar to was found by (Fer et al., 2018), confirming the general behaviour of SIPNET model.
SIPNET is representative of a larger class of models, with simple representations of carbon pools and fluxes, which are efficient and thus useful for computationally intensive large scale data assimilation problems. After constraining the model 340 against half hourly flux tower NEE, SIPNET showed considerable improvement (Fig 6). Nash Sutcliffe model efficiency showed an improvement for all years (2006)(2007)(2008)(2009)(2010)(2011) and PFTS (a total of 134,000 combined observations for Grass PFT and 251,000 observation for Conifer) except for conifers in 2011 (Table 1). for temperate deciduous PFT for two years of simulation and across 1 site. Both model assessment indices suggest that the SIPNET model is more constrained after calibration and ready for use within the state data assimilation routine.

State data assimilation
As mentioned before, SIPNET represents the terrestrial carbon budget by keeping track of three carbon pools, plant wood carbon, plant leaf carbon and soil carbon, and simulates the major carbon fluxes among these pools (allocation, turnover) 355 and between these pools and the atmosphere (photosynthesis, autotrophic and heterotrophic respiration) (Fig 1). In our data assimilation procedure, we sequentially assimilated above-ground biomass into the plant wood carbon and leaf area index into the plant leaf carbon on an annual time step, allowing soil carbon to be constrained based on the covariance estimated through the process model. Data assimilation was performed at all the sites at the same time where off-diagonal values in the process-model error covariance was adjusted according to Eq 9.

360
As an example of the general behaviour of the SDA, Fig 7 shows the reanalysis for a representative conifer site. The initial forecast starting from the BADM initial conditions has a wide posterior distribution and typically a relatively unconstrained mean forecast. The Analysis, on the other hand, converges quickly towards the observed AGB and remains thereafter as the compromise between the observation and the model forecast given their uncertainties. If no observation was available, for instance in the case of LAI from the beginning of the simulation until the year 2000 (Fig 7), the analysis step relied on 365 the covariance estimated among the state variables, both with-in site and adjacent sites. The ability to take advantage of the covariances for updating a state variable with no observation allows the SDA to constrain variables, such as soil C, that lack direct, site-level observations at large scales for data assimilation. At the end of each time step, ensemble members were adjusted given the estimated µ a and P a and used as an initial condition for the next time steps.
Starting in year 2000, we added a second observational constraint, MODIS LAI, and after which the Analysis LAI contracted 370 considerably around the observations. The soil C pool reanalysis was variable depending on the observation error or uncertainty in the driving forces for different sites and did not show an obvious response to the addition of LAI as a constraint (Fig 7).

Flux validation and ensemble weighting
The knowledge gained through data assimilation about the state of the carbon pools at each site and year was then transferred post-hoc to the flux time-series over the preceding time interval by estimating the posterior probability of each ensemble 375 member and using these probabilities to weight each ensemble member. At this stage, no additional observations were used to adjust the model estimates of different carbon fluxes such as NEE (Fig 8). For example in the Black Hills Ameriflux site (Fig   8), the NEE forecast by SIPNET, which already accounts for the initial condition constraint from the previous SDA step and the Examining the weights assigned to each ensemble member and the flux forecasts associated with them (Fig 8 top panels) shows two useful patterns. First, there are ensemble members that get consistently low weight, and this includes most of the ensemble members that predict unusual flux time-series (e.g. ensemble members 20, 16, and 11). These ensemble members are likely associated with parameter combinations that are not representative of this site (and potentially across sites) and points to a future opportunity to combine simultaneous state estimation and hierarchical parameter constraints (Dokoohaki et al., 2021).

390
This could be done by including the parameters in the analysis and ensemble adjustment, effectively nudging parameters toward more supported values, or through a parameter filtering and resampling, as is done in a particle filter (Dietze, 2017). The second pattern that is apparent in the ensemble weights is that the same ensemble members are not performing the same every year.
For example, an ensemble member may perform well one year, only to get very low weight the very next year, but then perform well again a few years later (i.e., ensemble members 2 & 5). To the extent that this reflects variations in parameters, not just 395 initial condition and driver variability, this also suggests that the 'best' parameters may vary through time. An important future direction to follow from this is to perform a more in-depth analysis of how model parameter weights are varying in both space and time to better understand what parameters are most variable, as this points to processes that are unaccounted for in the underlying models (e.g. if SOM respiration rate varies spatiotemporally it suggests the need to add additional terms to account for this variability). Furthermore, examining the patterns to this variability in space and time can provide important clues about 400 the underlying factors driving variability. Finally, we would need to extend our hierarchical parameter constraint to account for spatiotemporal variability in parameters, as opposed to the current approach that only captures spatial variability.
It is important to note that the patterns of ensemble weights in Fig 8 reflect  the Likelihood of the observed fluxes for each ensemble member and include this as a weight when calculating the forecast mean and covariance(i.e., µ f and P f ). This approach is similar to a particle filter and to ensemble variational data assimilation (Pinnington et al., 2020). For either approach, there are important questions to be explored about the spatial range at which point-level flux observations provide meaningful constraint on flux reanalysis across landscapes and regions.

415
The degree to which our data assimilation algorithm is able to constrain the soil C pool, which is not observed directly but inferred from the other observations, is related to the quality of AGB and LAI observations at each site, the strength of the covariances between the plant and soil pools, and the overall accuracy of the model. A comparison between the distribution of soil C estimated in this study against the soilGrids database shows slightly higher mean soil C estimates in this study (13.11 KgCm2 VS 8.77) but comparable variability across all plant functional types (Fig 9). The median uncertainty in soil C 420 reanalysis over all sites is between 15-20 Kg C m −2 , and this uncertainty did not change by adding LAI as a new data constraint.
Our reanalysis did reasonably well at predicting SOC for the conifer PFT but was consistently higher than corresponding soilGrids estimates across all other PFTs. Arid grassland showed the largest overestimation with 153% bias compared to the SoilGrid dataset; whereas, the deciduous PFT showed the lowest deviation with only 14 % bias. Furthermore, The largest agreement (Index of Agreement (Willmott, 1981) ∼ 0.7) was also found in conifer and deciduous PFTs between the two 425 datasets and lowest agreement was found to be 0.28 for arid grassland.
In rare occasions (i.e., 2 sites in arid and 3 sites in mesic grassland) in this study, we produced estimates of soil C which are not compatible with estimates from other data products such as soilGrids. This divergence might be attributed to instability in our statistical model in sites with sparse observations. Alternatively, the mismatch between how soil C pool is defined (i.e., labile versus recalcitrant) between different machine learning and process-based models may complicate the comparison 430 between these data products. It is important to note that this assessment of the performance of our reanalysis does not constitute a true validation because the SoilGrids product is itself a model. But, given that these two studies showed comparable means of soil C pool across ∼500 sites through two different methods with different philosophies, our soil carbon data products can be regarded as a new instrument that allows us to enhance the observational information and to derive higher-level products.
In the future, as the spatial resolution of our reanalysis increases, it will be important to perform a more detailed assessment against a wider array of detailed site-level soil data, such as the International Soil Carbon Network's database (Lawrence et al., 2020), the global soil respiration database (Blackard et al., 2008) and the National Ecological Observatory Network.

Forecast Uncertainty
One of the notable patterns in our proof-of-concept reanalysis was the tendency for the forecast uncertainty, P f , to be fairly large despite finding that our reanalysis posteriors had fairly high precision. In part, this is occurring because this system 440 is propagating a much broader set of uncertainties than previous studies. Variability in model forecasts can be attributed to the uncertainties in meteorological drivers, model parameters, initial conditions (i.e. analysis posteriors), and the unexplained process error (Dietze, 2017).
We did not find substantial shrinkage in average forecast variability for soil carbon and LAI pools after including the LAI observations into our data assimilation routine (Fig 10). In other words, adding a second data constraint did not result in 445 proportional reduction in forecast uncertainty for LAI and soil carbon. However, we noticed a slight reduction in forecast uncertainty for above ground woody biomass after addition of the second data constraint. Large observation error in MODIS LAI and the inconsistency in scale between the LANDSAT and MODIS resolution could be a few reasons for this observation.
Similar findings were reported by (sch), where their multi-stream carbon DA exercises found that a single data stream fit almost as well as multi-data stream assimilation, and by (Castro-Morales et al., 2019), where they found the largest portion of 450 information came from the first decade of data assimilation, similar to what was found in this study. Altogether, this suggests that initial condition uncertainty is not the dominant driver of forecast uncertainty in the one-step-ahead predictions that occur within the forecast-analysis cycle. Previous site-scale assimilation within the same framework, but with a different model (LINKAGES), found process error, Q, to be a dominant source of forecast uncertainty . residuals after the data assimilation. Given our simulation setup, we found that increasing the scaling factor from 0 to 900 Km would initially decrease the spatial auto correlations (Fig A1) from ν = 0 to ν = 500km and then the autocorrelation would plateau and slightly increase form ν greater than 500 Km. Using the ν = 500km with lowest spatial autocorrelation, in one hand ensures that our data assimilation algorithm takes advantage of local correlation among similar sites while, on the other hand it ignores/removes the spurious correlation estimated at larger distances.