A model-independent data assimilation (MIDA) module and its applications in ecology

Models are an important tool to predict Earth system dynamics. An accurate prediction of future states of ecosystems depends on not only model structures but also parameterizations. Model parameters can be constrained by data assimilation. However, applications of data assimilation to ecology are restricted by highly technical requirements such as model-dependent coding. To alleviate this technical burden, we developed a model-independent data assimilation (MIDA) module. MIDA works in three steps including data preparation, execution of data assimilation, and visualization. The first step prepares prior ranges of parameter values, a defined number of iterations, and directory paths to access files of observations and models. The execution step calibrates parameter values to best fit the observations and estimates the parameter posterior distributions. The final step automatically visualizes the calibration performance and posterior distributions. MIDA is model independent, and modelers can use MIDA for an accurate and efficient data assimilation in a simple and interactive way without modification of their original models. We applied MIDA to four types of ecological models: the data assimilation linked ecosystem carbon (DALEC) model, a surrogate-based energy exascale earth system model: the land component (ELM), nine phenological models and a stand-alone biome ecological strategy simulator (BiomeE). The applications indicate that MIDA can effectively solve data assimilation problems for different ecological models. Additionally, the easy implementation and model-independent feature of MIDA breaks the technical barrier of applications of data–model fusion in ecology. MIDA facilitates the assimilation of various observations into models for uncertainty reduction in ecological modeling and forecasting.

ing to considerable biases in predictions (Tao et al., 2020). The situation becomes even worse when more detailed processes are incorporated into models (De Kauwe et al., 2017;Lawrence et al., 2019). Data assimilation (DA), a statistically rigorous method to integrate observations and models, is gaining increasing attention for parameter estimation and uncertainty evaluation. It has been successfully applied to many ecological models (Fox et al., 2009;Keenan et al., 2012;Richardson et al., 2010;Safta et al., 2015;Wang et al., 2009;Williams et al., 2005;Zobitz et al., 2011). However, almost all those DA studies require model-dependent, invasive coding (Walls et al., 2005). This requires a DA algorithm to be programmed for a specific model. Such model-dependent coding creates a large technical barrier for ecologists to use DA to solve prediction and uncertainty quantification problems in ecology. Thus a model-independent DA toolkit is required to facilitate the use of DA technique in ecology.
DA is a powerful approach to combine models with observations and can be used to improve ecological research in several ways . First, DA can be used for parameter estimation (Bloom et al., 2016;Hararuk et al., 2015;Hou et al., 2019;Ise and Moorcroft, 2006;Ma et al., 2017;Ricciuto et al., 2011;Scholze et al., 2007). It enables the optimization of parameter values across sites, time and treatments (Li et al., 2018;Luo and Schuur, 2020). For example, Hararuk and his colleagues applied DA to a global land model and substantially improved the explainability of the global variation in soil organic carbon (SOC) from 27 % to 41 % (Hararuk et al., 2014). When DA was combined with deep learning to improve spatial distributions of estimated parameter values, for example, the Community Land Model version 5 (CLM5) predicted the SOC distribution in the US continent with much higher R 2 of 0.62 than CLM5 with default parameters (R 2 = 0.32) (Tao et al., 2020). Second, DA can be used to select alternative model structures to better represent ecological processes (Liang et al., 2018;Van Oijen et al., 2011;Shi et al., 2018;Smith et al., 2013;). In the study by Liang et al. (2018), DA was used to evaluate four models. And a two-pool interactive model was selected after DA to best represent SOC decomposition with priming. Additionally, DA can be applied to locate the most informative data to reduce uncertainty, thus guiding the sensor network design (Keenan et al., 2013;Raupach et al., 2005;Shi et al., 2018;Williams et al., 2005). One DA study at Harvard Forest (Keenan et al., 2013) indicated that only a few data sources contributed to the significant reduction in parameter uncertainty. In spite of powerful applications of DA to ecological research, computational cost is a major hurdle, especially with complex models. Fer et al. (2018) developed a Bayesian model emulation to reduce the time cost of DA from 112 to 6 h with the simplified Photosynthesis and Evapotranspiration model. Overall, DA is essential for ecological modeling and forecasting (Jiang et al., 2018) and is helpful for evaluation of different inversion methods (Fox et al., 2009).
Applications of traditional DA to ecological research require highly technical skills of users. A successful DA application usually involves model-dependent coding to integrate observations into models. This requires users to have knowledge about model programming. For example, if a complex model (e.g., the community land model) is used in DA, users need to know the programming language (e.g., Fortran) of the model and its internal content to write DA algorithm into the model source code before DA can be conducted. The learning curve for model programming is steep for general ecologists. Furthermore, users often need to update the programming knowledge when a different model is used in DA. For example, scientists who implemented the DA algorithm coded in MATLAB (Xu et al., 2006) to an ecosystem carbon cycle model programmed in Fortran (e.g., TECO) need to understand both MATLAB and Fortran (Ma et al., 2017). Moreover, DA often involves reading observation files about a specific study site. As a result, users usually have to update the codes of model-dependent DA to read new observations from every new study site.
A number of tools have been developed to facilitate DA applications ( Table 1) but many of them are model dependent, such as the Carbon Cycle Data Assimilation Systems (CCDAS) (Rayner et al., 2005;Scholze et al., 2007), the Carbon Data Model Framework (CARDAMOM) (Bloom et al., 2016), the Ecological Platform for Assimilating Data (Eco-PAD) into model (Huang et al. 2019) and Predictive Ecosystem Analyzer (PEcAn) (LeBauer et al., 2013). These tools combine DA algorithms with a specific model. For example, CCDAS specified the DA algorithm to the Biosphere Energy Transfer Hydrology (BETHY) model (Rayner et al., 2005). The hardcoding feature of aforementioned tools make them inflexible to be applied to different models. There are some model independent DA tools that are not tailored to a specific model, such as Data Assimilation Research Testbed (DART) (Anderson et al., 2009), the open Data Assimilation library (openDA) (Ridler et al., 2014), the Parallel Data Assimilation Framework (PDAF) (Nerger and Hiller, 2013) and Parameter Estimation & Uncertainty Analysis software suit (PEST) (Doherty, 2004). However, these model-independent tools suffer from some limitations for a general and flexible DA application. For example, openDA requires users to code three functions to initialize a Java class (Ridler et al., 2014) (Table 1). DART enables incorporating a new model through a range of interfaces (Anderson et al., 2009). It has been successfully applied to atmospheric and oceanic models with currently available interfaces (Anderson et al., 2009;Raeder et al., 2012) and recently to the community land model (Fox et al., 2018). It is likely that users may need to prepare new interfaces for new ecological models to use DART. DART and PDAF adopted the Ensemble Kalman Filter (EnKF) method (Evensen, 2003), which may makes it difficult to obey mass conservation for biogeochemical models. This is because the parameter values estimated by EnKF change each time when  (Allen et al., 2003;Gao et al., 2011;Trudinger et al., 2007). The sudden changes in estimated parameter values at time points when data are assimilated by EnKF usually do not reflect reality of biogeochemical cycles in the real world. PEST utilizes the Levenberg-Marquardt method (Levenberg, 1944), which is a local optimization method for parameter estimation. If the relationship between simulation outputs and parameters is highly nonlinear, which is common in ecological models, this method may trap into a locally optimization solution (Doherty, 2004).
In this work, we developed a model-independent DA module (MIDA) to enable a general and flexible application of DA in ecology. MIDA was designed as a highly modular tool, independent of specific models, and friendly to users with limited programming skills and/or technical knowledge of DA algorithms. Additionally, MIDA implemented advanced Markov chain Monte Carlo (MCMC) algorithms for DA analysis which can accurately quantify the parameter uncertainty with informative posterior distribution. The anticipated user community in this initial phase of MIDA development is the biogeochemical modelers who are looking for appropriate parameter estimation methods. In the following Sect. 2, we first introduce the development details of MIDA and its usage. In Sect. 3, we demonstrate the application of MIDA to four different types of ecological models. In Sect. 4, we discuss the strengths and weaknesses of MIDA in ecological modeling, and lastly we give our concluding remarks in Sect. 5.

Bayes' theorem and DA
Based on Bayes' theorem, DA is a statistical approach to constrain parameter values and estimate their posterior density distributions through assimilating observations into a model. The posterior density distributions p(C|Z) of parameters C for a given observation Z can be obtained from prior density distributions p(C) and the likelihood function p(Z|C): (1) The prior density distribution p(C) is assumed as a uniform distribution over the parameter range. And the likelihood function is negatively proportional to a cost function, J , as The cost function measures the misfit between simulation outputs and observations and is described in more detail in Sect.  The workflow includes data preparation, execution of data assimilation (DA), and visualization. The data preparation step is to provide all the formatted essential data for DA via user input. The execution step is to calibrate parameter values towards a constrained posterior distribution with the fusion of observations. The visualization step is to diagnose the effects of DA. The rhombus in orange represents user-input data. The rectangle represents procedures, and document/multidocument shape is for data files in computers. Dashed lines indicate locations of data. Solids lines indicate data flow pathways. With the three-step workflow, DA is agnostic to specific models, and users will be released from technical burdens.

An overview of MIDA
MIDA is a module that allows for automatic implementation of data assimilation without intrusive modification or coding of the original model (https://doi.org/10.5281/zenodo.4762725, Huang, 2021). Its workflow includes three steps: data preparation, execution of data assimilation, and visualization ( Fig. 1).
Step 1 (data preparation) is to establish the standardized data exchange between the DA algorithm and the model.
Step 2 (execution of data assimilation) is to run DA as a black box independent of the model. Step 3 (visualization) is to diagnose parameter uncertainty after DA. The modularity of the three-step workflow is designed to enable MIDA for a rapid DA application and adaption to a new model. In the following, we introduce the three-step workflows of MIDA, its technical implementation, and its usage in detail.

Step 1: data preparation
Step 1 is designed to initialize data exchange to transfer parameter values, model outputs, observations, and their variances between the DA algorithm and the model to be used. Four types of information are required either from interactive input or by modifying the "namelist.txt" file ( Fig. 1). The first type is about DA configuration, including the number of sampling series in DA and the working path where the outputs of DA will be saved. The number of a sampling series is essential in a DA task to define how many times parameter values are sampled to run the model. The second type of information is about parameter ranges and their covariance. The third is the model executable file. Finally, the fourth type is an output configuration file which contains the file paths of model outputs, observations, and their variance. This file also instructs how to read model outputs and compare each output with corresponding observations. Traditional DA requires users to modify the code of the model to incorporate the process of data exchange between the DA algorithm and the model. Therefore, the program of data exchange in traditional DA is model-specific, and users need to repeat such a program when a new model comes. In MIDA, the process of data exchange calls a model executable file which hides the details of the model code. When applied to a new model, MIDA only requires users to provide a different model executable file in the namelist.txt file and does not involve any additional coding in either the model or MIDA. Thus, MIDA lowers the technical barrier for general ecologists to conduct DA.
Traditional DA usually presets the number of parameters and the model outputs according to a specific model before initializing the data exchange. This is because data exchange between the DA algorithm and model uses memory to transfer items such as parameter values. Instead, MIDA organizes items in data exchange using different files. Items in data exchange are decided by the data file loaded when MIDA is running. The number of parameter values, for example, will be decided after the file of parameter range is read in MIDA. Through modifying files, MIDA allows efficient choices about the model-related items in data exchange to be made. Thus, MIDA is highly flexible and modular for DA with different models.
Traditional DA also presets observation types in the data exchange according to a specific study before the data ex-change. For example, if the traditional DA uses carbon flux observation, it cannot switch to satellite remote sensing products without additional coding. MIDA uses the concepts of object-orient programming (Mitchell and Apt, 2003) and dynamic initialization (Cline et al., 1998) in computer science to provide a homogenous way to create various observation types from a unified prototype class. A prototype class includes variables to store observations and their variance and functions (e.g., read from observation files). The values in variables are dynamically decided after the observation files are loaded when MIDA is running. Different observation types derive from the prototype class with a high degree of reusability of most functions. In such a way, MIDA only requires users to provide different filenames of the observations to be integrated in DA. Therefore, MIDA is highly flexible and modular for DA to assimilate various observations.

Step 2: execution of data assimilation
After the establishment of the standardized data exchange (step 1), step 2 is to run DA as a black box for users without knowledge of DA itself. Notwithstanding the black-box goal, this section provides a general description of DA below.
Data assimilation as a process integrates observations into a model to constrain parameters and estimate parameter uncertainties. Data assimilation usually uses some types of sampling algorithms, such as Markov chain Monte Carlo (MCMC), to generate posterior parameter distribution under a Bayesian inference framework (Box and Tiao, 1992). As mentioned in Sect. 2.1, DA with a MCMC algorithm estimates the posterior density distributions through sampling to maximize likelihood function p(Z|C) or minimize the misfit J between simulation outputs and observations. This version of MIDA uses the MCMC algorithm implemented by the Metropolis-Hastings (MH) sampling method (Hastings, 1970;Metropolis et al., 1953). The future version of MIDA could incorporate other data assimilation algorithms. Each iteration in the Metropolis-Hastings sampling includes a proposing phase and a moving phase. The proposing phase generates a new set of parameter values based on the starting point for the first iteration or current accepted parameter values in the following iterations. If parameter covariance (cov param ) is specified in step 1 on data preparation, this proposing phase will draw new parameter values (C new ) within the prior ranges from a Gaussian distribution N(C old , cov param ), where C old is the predecessor set of parameter values. Without parameter covariance, a new set of parameter values will be generated from a uniform distribution within the prior ranges (Xu et al., 2006).
The moving phase first calculates mismatches between observations and the model simulation with the new set of parameter values as a cost function (J new in Eq. 3) (Xu et al., 2006): where n is the number of observations, Z i (t) is the ith observation at time t, X i (t) is the corresponding simulation, and σ 2 i is the variance of the observation. The error is assumed to independently follow a Gaussian distribution. This new set of parameter values will be accepted if J new is smaller than J old , the cost function with the previous set of accepted parameter values, or the value, exp − J new J old , is larger than a random number selected from a uniform distribution from 0 to 1 according to the Metropolis criterion (Liang et al., 2018;Shi et al., 2018;Xu et al., 2006). Once the new set of parameter values is accepted, J new becomes J old . Those two phases of sampling will be iteratively executed until the number of sampling series set in step 1 on preparation of DA is reached. Finally, the posterior density distributions can be generated from all the accepted parameter values.
MIDA realizes the execution of data assimilation according to the procedure described above. First, MIDA uses a "call" function to execute model simulations to get values of X i (t). Observations Z i (t) and their variance σ 2 i are already provided via the standardized data exchange as described in step 1. Then, MIDA calculates J new according to Eq. (3) to decide the acceptance of the current parameter values used in this simulation. If accepted, MIDA saves this set of parameter values and associated J new values in C accepted and J accepted array, respectively, and triggers a new proposing phrase based on this set of accepted parameter values. If not, MIDA discards this set of parameter values and generates another new set of parameter values. MIDA saves the new parameter values generated in the proposing phrase to "Param-eterValue.txt", from which the model reads before execution of the next model simulation. MIDA repeats the proposing and moving phases until the number of sampling series is reached. At the end, MIDA selects the best parameter values through maximum likelihood estimation and runs the model again using this set of values to get optimized simulation outputs X i (t). Then MIDA saves the arrays of accepted parameters, associated errors, maximum likelihood estimates (MLEs), and optimized state variables X i (t) to four files, "parameter_accepted.txt", "J_accepted.txt", "MLE.txt", and "OptimizedSimu.txt", respectively.
This execution of the DA algorithm in MIDA enables users to conduct DA as a black box and is independent of any particular model.

Step 3: visualization
Step 3 is to visualize the results of DA in step 2. The end products of DA are accepted parameter values, their associated J new values, the maximum likelihood estimates, and optimized simulation results as saved in the output files. MIDA enables visualization of parameter posterior density distributions with a Python script. In the script, MIDA first read accepted parameter values from the parameter_accepted.txt file. Then, MIDA generates a posterior probabilistic density function (PPDF) for each parameter via the "kdeplot" function in the "seaborn" package. The maximum likelihood estimates of parameters correspond to the peaks of PPDF. The distinctive mode of PPDF indicates how well the parameter uncertainty is constrained. Finally, MIDA visualizes the PPDF for all parameters in a figure using the "matplotlib" package.

Implementation and architecture of MIDA
MIDA is equipped with a graphical user interface (GUI), and users can easily execute it through an interactive window. Users can also run MIDA as a script program without the GUI. MIDA is written in Python (version 3.7). For the GUIversion, all relevant Python packages used in MIDA are compiled together; thus users do not need to install them by themselves. For the non-GUI version, users need to install Python 3.7 and relevant packages (i.e., numpy, pandas, shutil, subprocess, matplotlib, math, os, and seaborn). MIDA is compatible with model source codes written in multiple programming languages (e.g., Fortran, C/C++, C#, MATLAB, R, or Python). It is also independent of multiple operation systems (e.g., Windows, Linux, MacOS). In addition, MIDA is also able to run on high-performance computing (HPC) platforms via task management systems (e.g., Slurm).
The architecture of MIDA is class-based, and each class is designed to describe an object (e.g., parameter, observations) with variables and operations. Five classes are defined in MIDA: parameter, observation, initialization, MCMC algorithm, and the main program. The main program is the start of MIDA execution. It calls functions from all other classes to finish a three-step workflow. As described in Sect. 2.2, parameter and observation classes contain variables to be transferred in data exchanges via file I/O operations. These operations are implemented using the "numpy" package. The initialization class is to read namelist.txt in step 1 on data preparation and to assign values for the variables in all other classes. Then the class of MCMC algorithm conducts DA as described in step 2. In this step, the simulation operation uses a call function in the "subprocess" package to call the model executable file. At the start of model simulation, MIDA writes new parameter values to the "ParameterValue.txt" file in the "working path" directory specified in step 1 on data preparation. Then model executable reads parameter values from the ParameterValue.txt file and run. After model simulation, the DA algorithm can read the model outputs by the output filenames indicated in the output configuration file. After DA, step 3 executes an additional Python script to read accepted parameter values and plot the posterior density distributions of parameters. The plotting operations use the matplotlib and seaborn packages. The implementation of GUI uses the pyQt5 toolkit to support interactive usage of MIDA. Users can also run MIDA in a non-interactive way with a "main.py" script to trigger the three-step workflows.

User information of MIDA
In order to use MIDA, users need to prepare data and a model. The data to be used in MIDA are prior ranges and default values of parameters, parameter covariances, output configuration file, observations, and their variances. They are organized in different files. Before running MIDA, users need to specify their filenames as suggested in step 1. When users want to use different data sets in DA, they can simply change filenames with the new data sets via GUI or in the namelist.txt file. Figure C1 is an example of the namelist.txt file for a data assimilation study with the DALEC model. The model to be used in MIDA should have those to-be-estimated parameter values not fixed in model source code rather than changeable through ParameterValue.txt file. MIDA writes new parameter values in each proposing phase during DA to the ParameterValue.txt file, from which the model reads the parameter values to run the simulation.
To calculate the cost function, J , we have to have a oneto-one match between observations and model outputs. For example, phenology models in one of the application cases of MIDA below generate discrete dates of leaf onset, which is a one-to-one match to the observations of spring leaf onset. In this case, observation Z i (t) and model output X i (t) to be used in calculation of J are straightforward. In the application case for dynamic vegetation, the data to be used are leaf area in six layers in a forest 302 years old, whereas the model simulates leaf areas in eight layers from 0 to 800 years. To match observation, the model generates outputs of leaf areas in six layers when simulated forest age reaches 302 years. This requires users to prepare an output configuration file to instruct MIDA to read model outputs and re-organize their outputs to match observation. The output configuration file starts with a single line listing an observation filename and its corresponding output filenames. Content after the directories in the output configuration file are instructions to map model outputs with the observation signified in the first line. Each instruction is to match one or continuous elements in observation with elements in outputs with the same length. A blank line means there are no further instructions. Then a new matching between another observation and model outputs starts. An example of output configure file is available in Appendix B.
Once MIDA finishes the execution of data assimilation, users may need basic knowledge to assess the performance of DA. For example, the acceptance rate, which is given by MIDA, is the fraction of proposed parameter values that is accepted. Ideally, the acceptance rate should be about 20 %-50 % (Xu et al., 2006). A very low acceptance rate indicates that many new proposed parameter values (C new ) are rejected because C new jumps too far away from the previously ac-cepted parameter values (Robert and Casella, 2013;Roberts et al., 1997). In this case, users are suggested to reduce a jump scale in the proposing phase. On the other hand, a very high acceptance rate is likely because C new moves slowly from the previously accepted parameter values. Users may increase the jump scale.
In addition, DA usually requires a convergence test to examine whether posterior distributions from different sampling series converge or not. A convergence test requires running DA parallelly or multiple times with different initial parameter values. MIDA provides a Gelman-Rubin (G-R) test (Gelman and Rubin, 1992) for this purpose. To use the G-R test, users need to prepare a file containing initial parameters values in different sampling series and indicate its filename in the namelist.txt file as described in step 1. If the G-R statistics approaches 1, the sampling series in DA is converged. When the sampling series is converged, all accepted parameter values are used to generate the posterior distributions.
There are three types of posterior distributions: bell shape, edge hitting, and flat. The bell-shaped posterior distributions indicate that these parameters are well constrained. Their peak values are the maximum likelihood estimates of parameter values. The flat posterior distributions suggest that the parameters are not constrained due to the lack of relevant information in data. The edge-hitting posterior distributions result from complex reasons, such as improper prior parameter range. Users may change the prior ranges to examine whether those posterior distributions can be improved or examine correlations among estimated parameters.

Applications of MIDA
We applied MIDA to four groups of models, which are an ecosystem carbon cycle model, a surrogate-based land surface model, nine phenology models, and a dynamic vegetation model. These four cases demonstrate that MIDA is effective for stand-alone DA, flexible to be applied to different models, and efficient for multiple model comparison.

Case 1: independent data assimilation with DALEC
The first case study is to demonstrate that MIDA can be effective for independent data assimilation with the data assimilation linked ecosystem carbon (DALEC) model . DALEC has been used for data assimilation in several studies (Bloom et al., 2016;Lu et al., 2017;Richardson et al., 2010;Safta et al., 2015;Williams et al., 2005). Previous studies all incorporated data assimilation algorithms into DALEC, which requires invasive coding. This case study is focused on reproducing the data assimilation results as in the study by Lu et al. (2017) but with MIDA.
The version of DALEC used in this study is composed of six submodels (i.e., photosynthesis, phenology, autotrophic respiration, allocation, litterfall, and decomposition) to sim-ulate the carbon exchanges among five carbon pools (i.e., leaf, stem, root, soil organic matter, and litter) (Ricciuto et al., 2011). There are 21 parameters in DALEC, of which 17 parameters are derived from the six submodels and four parameters serve to initialize the carbon pools. Table 2 summarizes the names, prior ranges, and nominal values of these 21 parameters. The observation is the Harvard Forest daily net ecosystem exchange (NEE) from the years 1992 to 2006. DALEC is coded in Fortran. In a Windows system, a gfortran compiler converts the model code to an executable file (i.e., DALEC.exe). Figure 2 is the GUI window of MIDA. We first set up a DA task as described in step 1 using the upper panel.
In this application, the number of sampling series is set as 20 000. Once users click the "choose a directory" or "choose a file" button, a new dialog window will pop up and users are able to choose the directory or load files interactively. As describe in step 1 on preparation of DA, the working path is where the outputs of DA and ParameterValue.txt are saved (e.g., C:/workingPath). After the output configuration file is loaded, the filenames of model outputs, observations, and their variance will be displayed in the window automatically. This application only uses a "NEE.txt" observation file. Similarly, after users load the parameter range file (e.g., a file named "ParamRange.txt" contains three rows which are minimum, maximum, and default values of parameters), the content in this file is displayed as well. To replace the current parameter range file loaded, users can simply upload another file. In this application, the executive model file is "DALEC.exe" with a Fortran compiler in a Windows system. Because we do not have parameter covariance information, this input is left blank. After "save to namelist file" is clicked, a namelist.txt file containing all the inputs will be generated in the working path.
After the DA task setup, we load the namelist.txt file and click the "run data assimilation" button in the lower panel to trigger step 2 on execution of DA. A new dialog will pop up to show the acceptance rate information and notify the termination of DA. Then we will click the "generate plots" button to visualize the posterior distributions of 21 parameters as described in step 3. Figure 3 shows that the simulation outputs using the optimized parameter values from MIDA better fit with the observations than those using default parameter values. Figure 4 depicts posterior distributions of the 21 parameters estimated from MIDA. More than half of the parameters are constrained well with a unimodal shape. X stem init and X root init have a wide occupation of the prior range, indicating that the observation data do not provide useful information for them. The constrained posterior distributions in this study are similar to those from Lu et al. (2017). Note that MCMC estimates have a large variance and a low convergence rate, especially in high-dimensional problems; with a finite number of samples it is not expected that two simulations would give exactly the same results.    . As ELM is computationally expensive (one forward model simulation takes more than 1 d), a sparse-grid (SG) surrogate system was developed to reduce the computational time . The forcing data for the surrogate model is half-hourly meteorological measurements at the Missouri Ozark flux site from 2006 to 2014. The observations that were used for optimization are annual sums of net ecosystem exchange (NEE), annual averages of total leaf area index, and latent heat fluxes from 2006 to 2010. The eight parameters selected (Table 3) are the most important parameters for the variations in outputs . The model is written in Python. A "pyinstaller" library packages the model code into an executable file. The iteration number in MIDA is 20 000. Figure 5 shows posterior distributions of calibrated parameters. c root , SLA top , t leaffall , and GDD onset are constrained  well with a unimodal distribution. However, the distribution of the other four parameters (i.e., N leaf , CN root , A r2l , and Res m ) clusters near the edge. These results match well with the study by Lu et al. (2018). As shown in Fig. 6, the calibrated parameters induce a performance improvement in simulating total leaf area index and NEE. For latent heat, both the default and optimized simulation obtain good agreement with the observation. These conclusions are also similar to those in Lu et al. (2018). MIDA hides the detailed differences between models. For example, the DALEC model in case 1 is a processbased model to simulate the ecosystem carbon cycle while surrogate-based ELM in case 2 is an approximation of a land surface model. They are also different in programming language, simulation time, forcing data, etc. MIDA is able to deal with models with so many different characteristics and hides these differences from users. Users only need to indicate the filenames of the model to be used, its parameter range, the output configuration file, etc. in the namelist.txt file. Thus, MIDA simplified the DA applications using different models.

Case 3: evaluation of multiple phenological models
This study case uses nine phenological models (Yun et al., 2017) to demonstrate the applicability of MIDA in model comparison. Five out of the nine models predict phenological events, such as the day of leaf onset, using growing degree days, which are calculated as temperature accumulation above a base temperature. The other four models consider  two processes: chilling effects of cold temperature on dormancy before budburst and forcing effects of warm temperature on plant development. Each model uses different response functions to represent chilling and forcing effects. The detailed model descriptions and associated parameter information are in the Supplement table.
Data are from the Spruce and Peatland Responses Under Climatic and Environmental Change experiment (SPRUCE)  located in northern Minnesota, USA. The experiment consists of five-level whole-ecosystem warming (i.e., +0, +2.25, +4.5, +6.75, +9 • C) and twolevel elevated CO 2 concentrations (i.e., +0, +500 ppm). Dates of leaf onset were observed with PhenoCam  for tree species Picea mariana and Larix laricina. For the sake of demonstration of MIDA application, we only show DA results for Larix laricina with +9 • C warming treatment and +0 ppm CO 2 treatment from 2016 to 2018.
MIDA was used to compare performances of the nine models in reference to the same observations of leaf onset dates after DA. We as users changed filenames of model executable files (i.e., PhenoModels.exe), defined parameter ranges, and assigned the directory of working path for each model. MIDA then estimated the optimized parameters and saved the corresponding best simulation outputs to the working path for each of the nine models. Figure 7 shows the best simulation output of these nine models. The simulation output of the sixth, seventh, eighth, and ninth models better fits the observation than the other models. It demonstrates that models that consider both chilling and heating effects can achieve good simulations of the leaf onset dates.

Case 4: supporting data assimilation with a dynamic vegetation model
This case study is to examine the efficiency of MIDA to integrate remote sensing data into a dynamic vegetation model. The model used in this study is Biome Ecological strategy simulator (BiomeE) (Weng et al., 2019). This model simulates vegetation demographic processes with individualbased competition for light, soil water, and nutrients. Individual trees in BiomeE model are represented by cohorts of trees with similar sizes. The light competition among cohorts is based on their heights and crown areas according to the rule of perfect plasticity approximation (PPA) model (Strigul et al., 2008). Each cohort has seven pools: leaves, roots, sapwood, heartwood, seeds, nonstructural carbon, and nitrogen. After carbon is assimilated into plants via photosynthesis, the assimilated carbon enters the nonstructural carbon pool and  is used for plant growth (i.e., diameter, height, crown area) and reproduction according to empirical allomeric equations (Weng et al., 2019). In this application, two parameters to be constrained (Table 4) are annual productivity rate and annual mortality rate of trees.
Observations to be used in DA are leaf area indexes in six vertical heights (i.e., 0-5, 6-10, 11-15, 16-20, 21-25, and 26-30 m) at the Willow Creek study site, Wisconsin, USA. The forest at the site is an upland deciduous broadleaf forest around 302 years old. The observations were from Global Ecosystem Dynamics Investigation (GEDI) acquired by a light detection and ranging (lidar) laser system, which is deployed on the International Space Station (ISS) by NASA in (Dubayah et al., 2020. The observations were first averaged from three footprints, and then leaf area indexes in the six canopy layers were standardized to be summed up as 1. To use MIDA, we reorganized the simulation outputs to match observations as suggested in Sect. 2.6. The BiomeE model simulates leaf areas in eight layers (i.e., 0-5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, and 36-40 m) from 0 to 800 years. An output configuration file was provided to post-process model outputs of leaf area indexes in six layers to match observations at the forest age of 302 years. These simulated leaf area indexes in the six canopy layers were also standardized to match standardized observations of leaf area indexes. The observations and post-processed simulation outputs were saved to "LAI.txt" and "simu_LAI.txt" files, respectively. The two files are used in MIDA for data assimilation to generate posterior distributions of the two estimated parameters as shown in Fig. 8. The optimized parameter values through maximum likelihood estimation are different from their default values. Figure 9 compares the simulation outputs with optimized parameters estimated by MIDA to those with default parameter values. After DA with GEDI data in MIDA, the simulation accuracy of leaf area index is substantially improved, especially in middle (16-20 m) and highest (26-30 m) layers.

Discussion
This study introduced MIDA as a model-independent tool to facilitate the application of data assimilation in ecology and biogeochemistry. The potential user community is ecologists with limited knowledge of model programming and technical implementation of DA algorithms. Several modelindependent DA tools have already been developed, such as DART (Anderson et al., 2009), openDA (Ridler et al., 2014), PDAF (Nerger and Hiller, 2013), and PEST (Doherty, 2004), mainly for applications in the research areas of hydrology, atmosphere, and remote sensing. These DA tools either use the gradient descent method, such as the Levenberg-Marquardt algorithm in PEST, or Kalman filter methods, such as EnKF in DART, openDA, and PDAF. The Levenberg-Marquardt algorithm is a local search method, for which it is hard to find a global optimization solution for highly nonlinear models. EnKF updates state variables and parameter values each time when observations are sequentially assimilated, resulting in discrete values of estimated parameters. Jumps in estimated parameter values by EnKF make it very difficult to obey mass conservation in biogeochemical models . In this study, we used the MCMC method in MIDA to generate  parameter values and their posterior distributions. MCMC is a widely used method in many DA studies with biogeochemical models but has been applied to individual models with invasive coding (Bloom et al., 2016;Hararuk et al., 2015;Liang et al., 2018;Luo and Schuur, 2020;Ricciuto et al., 2011). Compared to the other model-independent DA tools mentioned above, MIDA is the first tool that uses the MCMC method for DA. Biogeochemical models are incorporating more detailed processes related to carbon and nitrogen cycles (Lawrence et al., 2020). Complex biogeochemical models yield predictions with great uncertainty (Frienlingstein et al., 2009(Frienlingstein et al., , 2014. Data assimilation has been increasingly used to estimate parameter values against observations and reduce uncertainty in model prediction (Luo et al., 2016;Luo and Schuur, 2020). However, current applications of DA are almost all model dependent. This requires ecologists to write code to integrate the DA algorithm into models. The coding practice is a big technical challenge for ecologists with limited programming ability. The distinct advantage of MIDA is to enable ecologists to conduct model-independent DA. MIDA streamlines workflow of the three-step procedure for DA to enable users to conduct DA without extensive coding. Users mainly need to provide numerical and character values for data exchanges to transfer data (i.e., parameter values, simulation outputs, observations) between the model and MIDA by a file named namelist.txt or by interactive inputs via a GUI window (Fig. 2). Simulations of LAI are substantially improved after data assimilation in comparison with those before data assimilation.
We tested MIDA in four cases for its applicability to ecological models. The first case is applied to the DALEC model, which has been used in several data assimilation studies (Bloom et al., 2016;Lu et al., 2017;Safta et al., 2015;Williams et al., 2005). The previous DA studies all used invasive coding to incorporate the DA algorithm into models. As demonstrated in this study, MIDA was applied to DALEC without invasive coding but by providing the directory to save DA results and filenames of DALEC model executable file, parameter prior range, and output configuration files through the namelist.txt file or interactive inputs in the first preparation step of the workflow. Then, MIDA runs DA as a black box with DALEC before visualizing the DA results. Next, we tested the applicability of MIDA, a surrogatebased ELM model, and a dynamic vegetation model BiomeE.
To switch the test case from DALEC to the surrogate-based ELM model and the BiomeE model, we changed the filenames of the model executable file, parameter prior range, and output configuration file in the namelist.txt file for MIDA. This flexibility of MIDA in switching models for DA makes it much easier for model comparisons. We tested this capability of MIDA with nine phenological models to compare alternative model structures. Similarly, MIDA enables efficient switches of observations to be assimilated into models. Users only need to change filenames of observations in the output configuration file. This feature of MIDA makes it easier to utilize abundant trait databases such as TRY (Kattge et al., 2020), FRED (Iversen et al., 2017), etc. Moreover, this feature of MIDA also helps evaluate the relative information content of different observations for constraining model parameters and prediction . Consequently, MIDA can facilitate selection of the most informative observations and then better guide data collections in field experiments. Ultimately, MIDA can aid ecological forecasting and help reduce uncertainty in model predictions Jiang et al., 2018).
Although MIDA helps users to get rid of model detail, users may still need basic knowledge about the model outputs to prepare the output configuration file which is to match model outputs to observations one by one (see Sect. 2.6). This effort of preparing the correspondence between model outputs and observations for MIDA is not that difficult because users are reading or writing a text file, and most model developers will provide reference to help understand observations or model output files.
Generally, MIDA requires longer time to run DA than the embedded DA algorithm, because MIDA calls model simulation as an external executable file rather than a function embedded. Thus, we recommend MIDA for beginners of DA users with models that are less complex. Besides, the current version of MIDA only incorporates the Metropolis-Hastings sampling approach. More MCMC methods (e.g., Hamiltonian Monte Carlo) may be incorporated into MIDA in the future.

Conclusions
We developed MIDA to facilitate data assimilation for biogeochemical models. Traditional DA studies require ecologists to program codes to integrate DA algorithms into model source codes. The easy-to-use MIDA module enables ecologists to conduct model-independent DA without extensive coding, thus advancing the application of DA for ecological modeling and forecasting. We demonstrated the capability of MIDA in four cases with a total of 12 ecological models. These cases showed that MIDA is easy to perform for a variety of models and can efficiently produce accurate parameter posterior distributions. Moreover, MIDA supports flexible usage of different models and different observations in the DA analysis and allows a quick switch from one model to another. This capability enables MIDA to serve as an efficient tool for model intercomparison projects and enhancing ecological forecasting. The growing degree (GD) model is one of the most widespread phenological models to simulate the date of leaf onset (D). In this study, the timescale is limited to daily based on observation records. The kernel of GD is to calculate the growing degree days (GDD,D −1 d = D s d), which is the heat accumulation above a base temperature (T b ). For simplicity, the daily temperature (T d ) can be approximated by the average of daily maximum and minimum temperatures. The heat accumulation starts at day D s , which is empirically estimated, and ends when GDD reaches a forcing requirement threshold (R d ). Two parameters to be constrained are base temperature (T b ) and the forcing requirement (R d ). Their default values and prior range are listed in Table A1.

A2 Sigmoid function (SF)
Compared to the linear response function of GDD in the GD model, the sigmoid function (SF) model provides a nonlinear function to better represent the non-linearity of the growth response to heat accumulation. Three parameters to be constrained in DA are base temperature (T b ), the forcing requirement (R d ), and temperature sensitivity (S t ). Their default values and prior range are listed in Table A1.

A3 Beta function (BF)
In reality, the plant growth rate, as described with d, gradually increases up to a specific temperature and then rapidly declines to a supra-optimal level. Such a response can be well described by a beta function with uni-modality and nonsymmetrical shape. Three parameters are involved in DA: minimum temperature (T n ), optimal temperature (T o ), and forcing requirement (R d ). The other parameter values are fixed with empirical values. For example, maximum growth rate (R x ) is set to 1, and maximum temperature (T x ) is as-sumed to be 45.
A4 Days transferred to standard temperature (DTS) According to Arrhenius law, the relationship between growth rate and daily temperature T d can be interpolated by Eq. (A8) (Ono and Konno, 1999). With a factor weighted by standard temperature, the equation for DTS (Eq. A9) can better represent growth rate dependent on temperatures. Three parameters considered in DA are temperature sensitivity rate (E a ), standard temperature (T s ), and forcing requirement (R d ).
A5 Thermal period fixed model (TP) The difference between GD and TP models is that heat accumulation occurs in a fixed time period (D n ). The day of leaf onset is the last day (D s + D n ) when the accumulated heat reaches the forcing requirement. The start day (D s ) of heat accumulation begins on day one and moves 1 d forward each time to estimate Eq. (A12). Three parameters are involved in DA: the base temperature (T b ), the period length (D n ), and the forcing requirement (R d ).

A6 Chilling and forcing (CF)
Compared to GD, there is another distinctive chilling period for dormancy. The CF model sequentially calculates two accumulations in opposite directions: chilling accumulation and anti-chilling accumulation. The start day of chilling accumulation (D s ) is implicitly set as 273.0, which is 1 October. The end day of chilling accumulation (D 0 ) is the beginning of anti-chilling accumulation. Three parameters are considered in DA: the chilling requirement (R C d ), the forcing requirement (R F d ), and the temperature threshold (T c ).
The difference between CF and SM models is that SM used a beta function (Eq. A18) for the calculation of chilling accumulation and adopted a sigmoid function (Eq. A20) for antichilling accumulation. The detailed descriptions of these two functions can be referred to in the introductions of the BF model and CF model. The maximum temperature is empirically set as 13.7695. Six parameters are constrained in DA: minimum temperature (T n ), optimal temperature (T o ), temperature sensitivity (S t ), forcing base temperature (T b ), chilling requirement (R C d ), and forcing requirement (R F d ).

A8 Parallel model (PM)
The critical difference between PM and the above two-step models is that the chilling and anti-chilling accumulations happen simultaneously (Fu et al., 2012). In the earlier dates during the chilling period, only a small fraction (K d ) of forcing (Eq. A25) will be accumulated. The maximum temperature is empirically set as 15.3. Seven parameters will be considered in DA: minimum temperature (T n ), optimal temperature (T o ), temperature sensitivity (S t ), forcing base temperature (T b ), chilling requirement (R C d ), forcing requirement (R F d ), and a forcing weight coefficient (K m ).

A9 Alternating model (AM)
The AM fixes the start date of the chilling period (D C s ) as 1 November and the start date of anti-chilling period (D F s ) as 1 January. The difference between the AM and the other models above is that the forcing requirement is not a parameter value but is decided by the length of chilling days (Fu et al., 2012). Five parameters to be constrained in DA are chilling temperature (T c ), forcing base temperature (T b ), and three coefficients (a, b, c) in calculation of the forcing requirement. The output configuration file (e.g., config.txt) is to indicate the directories of observations and simulation output files as well as how they map to each other. Figure B1 is an example of the output configuration file. There are three blocks of functions to map simulation outputs to observed gross primary production (GPP), respiration (RE), and net ecosystem exchange (NEE). The blocks of mapping functions are separated by a blank line. Each mapping block starts with the directories of one observation, its observation variance, and model outputs, which are separated by a hash key. If there is no observation variance available, users can ignore this directory. If multiple simulation outputs are used to correspond to one observation, the directories of simulation outputs are separated by a comma. The rest of the mapping block describes how to map simulation outputs to observations. The simu_map variable is simulation output after mapping. The simuList variable saves the simulation outputs specified in the first line. Taking the third mapping block in Fig. B1 Figure C1 shows an example of the namelist.txt for the first study case with the DALEC model. Users need to prepare the namelist.txt before execution of data assimilation (DA) either manually or via GUI. Below describes the content in the namelist.txt. Detailed explanation and tutorials are available in the Zenodo repositories at the end of the appendixes. "workpath" is the directory where the MIDA executable files are saved. "nsimu" is the number of iterations in execution of data assimilation. "J_default" is the default mismatch (i.e., cost function) to be compared in the first moving phase of data assimilation. "ProposingStepSize" controls the jump scale in the proposing phase of data assimilation. Users can increase or decrease this value to adjust the acceptance rate to be in a range from 0.2 to 0.5. "paramFile" is the directory of a csv file saving parameter-related information such as parameter range. "obsList" saves the directories of observations. Multiple observations are separated by semicolon. Similarly, "obsVarList" saves the directories of observation variance in the same order as that of obsList. "simuList" saves the directories of simulation outputs corresponding to the observations. With GUI, MIDA reads directories in the output configuration file (e.g., config.txt), which users provide and assign values for obsList, obsVarList, and simuList in the namelist.txt automatically. In this case, if the directories of observations change, users only need to modify the output configuration file and generate the namelist.txt again with GUI-based MIDA.
"paramValue" is the directory of a txt file where MIDA writes out a new set of parameter values for model execution in each iteration of data assimilation. Its default value is ParameterValue.txt under the workpath specified in the first line of the namelist.txt. "model" saves the directory of model executable files. "nChains_convergeTest" indicates whether to conduct a Gelman-Rubin (G-R) convergence test or not. If the G-R test is used, its values are the number of multiple MCMC chains. If not, its value is zero. "convergeTest_startsFile" is the directory of a csv file saving default parameter values as the start points in multiple MCMC chains. "outConvergenceTest" saves the results of the G-R test. If "nChains_ConvergeTest" is zero, both values of "convergeTest_startsFile" and "outConvergenceTest" are empty. "DAresultsPath" is the directory saving the results of DA whose directories are also listed in the following six lines: "outJ" for the accepted mismatches, "outC" for the accepted parameter values, "outRecordNum" for the number of accepted parameter values, "outBestSimu" for the best simulation outputs with the optimal parameter values, and "outBestC" for the optimal parameter values. For MIDA without GUI, "display_plot" indicates whether or not to visualize the posterior distributions after DA. Figure C1. An example of the namelist.txt file. In order to use MIDA, users need to prepare data and a model and specify their file names and directories in the namelist.txt file.
Author contributions. XH, IS, and YL designed the study. XH built the workflow of MIDA and tested its capability in four cases. DL, DMR, and PJH provided data and models for the first and second test cases. XL prepared models, and ADR provided observations for the third case. EW and SN helped to prepare data and models for the fourth case. XH, LJ, EH, and YL analyzed the results. All authors contributed to the preparation of the manuscript.
Competing interests. The authors declare that they have no conflict of interest.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Review statement. This paper was edited by Hisashi Sato and reviewed by two anonymous referees.