Uncertainties in estimating regional methane emissions from rice paddies due to data 1 scarcity in modeling approach

10 Rice paddy is a major anthropogenic source of the atmospheric methane. But 11 because of the high spatial heterogeneity, making accurate estimation of the methane 12 emission from rice paddies is still a big challenge, even with complicated models. 13 Data scarcity is a substantial cause of the uncertainties in estimating the methane 14 emissions on regional scales. In the present study, we discussed how data scarcity 15 affected the uncertainties in model estimations of rice paddy methane emissions, from 16 site scale up to regional/national scale. The uncertainties in methane emissions from 17 rice paddies of China was calculated with a local-scale model and the Monte Carlo 18 simulation. The data scarcities in five of the most sensitive model variables, field 19 irrigation, organic matter application, soil properties, rice variety and production were 20 included in the analysis. The result showed that in each individual county, the within21 cell standard deviation of methane flux, as calculated via Monte Carlo methods, was 22


Introduction
Methane is not only an important greenhouse gas in the atmosphere, but also an active reactor in many atmospheric chemistry processes.Rice cultivation has been recognized the major anthropogenic activity that accounted for the rapid increase of the atmospheric methane concentration.But because of the high spatial heterogeneity in methane emissions from rice paddies, huge uncertainty has long been the big problem in making reliable estimations, even after complicated models were developed and applied (Li, et al., 2002;Zhang et al., 2011;Harvey, 2000).The models used in regional or global studies differ widely in terms of their spatial scales.Many of these models are site-specific, describing processes at local scales Extrapolating a site-specific model to a regional or global scale is usually referred to as -model upscaling‖ (King, 1991;van Bodegom et al., 2000).A common framework for this upscaling involves partitioning a large region into smaller, individual areas and running the model for each area (Matthews et al., 2000;Li et al., 2004;Yu et al., 2012).
In model upscaling, the first problem modelers face is how to make the spatial divisions (each division was call a cell, hereafter).It is preferable to partition the region so that the model inputs in the cells are as statistically independent of each other as possible (King, 1991;Ogle et al., 2003Ogle et al., , 2010)).When data are scarce, however, the criterion of inter-cell independence may result in the partition of large cells, leading to a reduced level of spatial details.An additional challenge is the great variability in the availability of data for the model inputs, which complicates the selection of an appropriate cell size.A properly partitioned subject region should balance the differences in spatial data abundance among model inputs.If the cell size is too large, substantial spatial variation in the model input variables will be lost after within-cell averaging (van Bodgegom et al., 2002;Verburg et al., 2006).Scientists tend to use the finest spatial resolution possible to express details in spatial variation in their modeling results.However, a finer spatial resolution requires sufficient model input data; otherwise, data must be shared among cells for at least some, if not all, the model inputs.This type of inter-cell non-independence among the cells (resulting from data scarcity and requiring data sharing) complicates the uncertainty analysis (Ogle et al., 2003) when finer spatial resolutions are adopted.To estimate regional/national methane emissions from rice paddies, it is critical to obtain detailed information on organic matter amendments, soil properties, rice varieties and field irrigation in rice cultivation (Khalil et al., 2008;Peng et al., 2007;van Bodegom et al., 2000;Wassmann et al., 1996).Such data, however, are seldom available at a regional scale (Zhang et al., 2011).
To analyze the uncertainty due to errors in model inputs in each cell, the Monte Carlo simulation has been recognized as an effective method (IPCC, 2000), and it has been applied in many studies (Ogle et al., 2003(Ogle et al., , 2010;;Yu et al., 2012).Based on the probability distribution functions (PDFs) derived from measurements and/or a priori knowledge of the model inputs, the Monte Carlo method involves randomly and repeatedly drawing values from the PDFs to drive the model and produce varying model estimates.After the Monte Carlo simulation is performed for a within-cell uncertainty analysis in each division, we face the problem of uncertainty upscaling.In the case of -independent‖ partitioning of the entire subject region, an independent random variable is assigned to depict variations in the model estimate for each division (IPCC, 2000;Ogle et al., 2010), the uncertainty upscaling can be quite simple, as explained by the statistical -Law of Large Numbers‖.As previously noted, however, a paucity of data for some of the model variables and a small cell size may result in data sharing among divisions, which is problematic for the model variables that lack sufficient data to support fine-resolution partitioning.Upscaling the uncertainties in the model outputs must deal appropriately with this type of -dependency‖.
The objective of the present study is to evaluate the impacts of data scarcity on the uncertainty in regional estimations of rice paddy methane emissions, and discuss how different spatial resolutions affect the regional estimation uncertainties, given the same data availability for different spatial division schema.Although many studies have demonstrated how to upscale a model to make regional estimations from various baseline scenarios (Matthews et al., 2000;Li et al., 2004;Ogle et al., 2010), the primary focus of the present study is the aggregation of the uncertainties in model estimations due to data scarcity.

Within-cell variation in model estimates
When partitioning the large region under consideration into spatially adjacent divisions, the within-cell variation must be accounted for first (King 1991;van Bodegom et al., 2000;Ogle et al., 2003Ogle et al., , 2010)).The baseline model estimate is usually established by running the model once in a cell.Each model input variable will have one datum or one time series of data, e.g., daily weather observations.If there are multiple data available for a model input variable in a cell, they are averaged before modeling.The within-cell heterogeneity of the model estimate will therefore be lost after averaging, which will cause errors in the model's estimation.This type of error is referred to as the -fallacy of average‖ (Verburg et al., 2006).In contrast, the withincell PDF of the variation in the model variable can also be established by statistical analysis of the data and/or expert estimation (Ogle et al., 2010;IPCC, 2000).Monte Carlo simulation is considered an effective approach to evaluate within-cell variation or uncertainty in model estimates due to errors in model input variables and their interactions, and it is thus used in the present study (Fig. 1).

Spatial uncertainty aggregation in the case of data scarcity
In each cell, the model estimation via Monte Carlo iteration produces a numeric depiction of a random variable V i (m i , σ i ), where m i and σ i are the statistical mean and standard deviation, respectively, of the random variable V i .Thereafter, the model upscaling involves the summation of the random variables V 0 =V 1 +V 2 +...+V N .The aggregation of uncertainty, represented by the statistical variance or standard deviation, is generalized as Ross, 2006), and it can also be transformed into quadratic summation of the elementary variances via the standardized variance-covariance matrix: where 2 0  is the aggregated variance of the regional estimation and σ i and σ j are the standard deviations of the within-cell variations in cells i and j, respectively.The matrix C is comprised of coefficients C ij , which stand for -correlations‖ between individual cells.Here, the -correlation‖ is a measure of how the model outputs in two cells vary coincidently because they share common data and modeled processes for the model inputs.If the estimation in cell i is over-/under-estimated, the estimation in cell j will most likely be over-/under-estimated as well because they share common data, and vice versa.The aggregation of the model outputs can be quite simple if the model estimate is made with independent data in each cell.In this case, the matrix C will be an identity matrix in which the diagonal elements will be 1 and all the offdiagonal elements will be 0. The aggregation in equation (1) will thereafter indicate the arithmetic sum of the within-cell variances, as addressed by the Law of Large Numbers.However, when there are not sufficient data to support independent calculation among cells, the off-diagonal elements, C ij , of the matrix C will no longer be zero.
In the present study, C ij was empirically calculated via numerical experiments.
For different levels of data sharing between two cells (Table 1), the model estimations for the two cells were iteratively calculated with CH4MOD.The model inputs were randomly selected from the ranges of the variables (Table B1).When there was data sharing between the two cells for a variable in Table 1, the value of the variable was selected once for both cells.And for variables with no data sharing, the value of the variable was selected separately for the two cells.The correlation coefficients (C ij ) of the model estimations in the two cells was statistically calculated with a large number, 1000 iterations in the present study, of paired model estimations for the two cells.

Indicators of data scarcity in model estimation
A common problem in making a model estimation for a large region is that the available data for the model input variables differ greatly.To evaluate the overall data scarcity of the model input variables, two indicators are defined: where C ij is the element of the DS (data sharing) matrix defined in equation (1) and n is the total number of off-diagonal, non-zero elements of the DS matrix.In equation Data scarcity refers to the abundance of data relative to the spatial resolution, i.e., spatial details we intend to depict via the model simulation.With all the model input data on hand, we may expect more data scarcity, and a larger I ds , when we choose a smaller cell size and vice versa.An I ds of 0 indicates a "perfect" data abundance for the chosen spatial resolution.However, this "perfection" may, conversely, imply that we have chosen too large of a cell size and that some spatially varying details in the model inputs were lost, a severe -fallacy of average.‖The regional partitioning should, in this case, adopt a finer spatial resolution to show more heterogeneous details in the model estimation.

CH4MOD and input variables
In this case study, we used the model CH4MOD to estimate methane emissions from rice paddies in China.CH4MOD is a semi-empirical model that simulates methane production and emissions from rice paddies under various environmental conditions and agricultural practices (Huang et al., 1998a(Huang et al., , 2004;;Xie et al., 2010).
The CH4MOD model runs with a daily step and is driven by air temperature.The main input variables include the soil sand percentage (SAND), organic matter amendments (OM), rice grain yield (GY), water management pattern (W ptn ) and rice cultivar index (VI).Appendix A describes CH4MOD and the compilation of the model inputs.More detailed information regarding the model development, validation and application has been provided elsewhere by the authors (Huang et al., 2004(Huang et al., , 2006;;Zhang et al., 2011).

PDFs of the model input variables
Many studies (Khalil and Butenhoff, 2008;Li et al., 2004;Matthews et al, 2000;Van Bodegom et al, 2002) have suggested that a significant proportion of the uncertainty in regional rice paddy methane emissions arises from data scarcity, especially with regard to the soil sand content (SAND), organic matter amendments (OM), rice grain yield (GY), water management (W ptn ) and rice cultivar index (VI).
The CH4MOD sensitivity analysis similarly indicates the importance of these five factors in methane emissions (Table B1 in Appendix B).Fig. 2 illustrates the data abundance of the five model variables.The data for soil sand content is a 10 km by 10 km raster dataset constructed from soil profiles via spatial interpolation (Oberthür et al., 1999;Shi et al., 2004Shi et al., , 2006)).Although a certain proportion of the immense spatial variation in soil properties may be lost after spatial interpolation (Goovaerts, 2001;van Bodegom et al., 2002), the gridded soil data are still the most detailed of the five model inputs.In descending order of data abundance, the other four factors are GY, OM, W ptn and VI.Assuming a normal distribution, the PDFs of four factors (all except W ptn ) were parameterized by statistical analysis of their data.
With a specific spatial resolution, e.g., using administrative counties as divisions, the PDF of SAND in a division was calculated with the grid data within the division.
Because every county has only one datum for GY, no PDF was assumed for GY when counties were adopted as divisions.Although the yield of rice grain is not the same at every location throughout a county, we have no more detailed data on grain yield that would allow us to make PDFs of the GY variable.
The data on the other two variables, OM and W ptn , were collected and statistically analyzed to produce PDFs (Table 2 and Table 3) at provincial and grand region scales (Fig. 2b).Rice paddy methane emissions vary notably with rice variety (Singh et al., 1997).The variety index (VI), which accounts for the methane emission differences between rice varieties (Huang et al., 1998a(Huang et al., , 2004)), ranges from 0.5 to 1.5, and it typically has a value close to 1.0 for most rice varieties (Huang et al., 1997(Huang et al., , 2004)).We assumed that the 95% confidence interval (CI) for VI was 0.5 to 1.5 and that it exhibited a normal distribution.In the case of partitioning the entire nation into counties, the counties included within a province and/or grand region must share data and PDFs for the variables OM, W ptn and VI.
The PDFs in the case study of rice paddy methane emissions did not encompass all sources of uncertainties for the five variables.Careful planning in building PDFs of the model variables will improve the reliability of the uncertainty assessment.At present, we are focused on uncertainty aggregation in model upscaling when facing data scarcity.

Uncertainty calculation and aggregation
To evaluate how the adoption of cell sizes influences the uncertainty of regional estimations, we used three partitioning schema-S1, S2 and S3-to estimate the methane emissions in China with the same previously described datasets.The counties, provinces and grand regions of China were used as the spatial divisions in the three scenarios, respectively.In S2 and S3, PDFs of the rice grain yield were calculated based on a statistical analysis of census data.The Monte Carlo iteration was performed 500 times in each cell to calculate the within-cell uncertainty.
For each of the three scenarios, the elements of the DS matrix were valued by referencing the correlation coefficients (C ij ) in Table 1 based on the state of data sharing illustrated in Fig. 2b.With the within-cell variations in methane emissions calculated via the Monte Carlo approach, the aggregation of the model estimates was then performed via equation (1) for early, late and middle rice.When combining the estimation results for the three rice ecosystems, equation (1) was again utilized for the OM and VI data shared by the three rice ecosystems.
After aggregation, the confidence interval, e.g., 95% CI of the national methane emission, was derived via the parameterized PDF of the aggregated estimate.
Assuming a Gamma distribution (Fig. B1 in Appendix B), the two parameters of the PDF, shape (α) and scale (β), were calculated by the momentum method, where β=variance/average and α=average/β (Ross 2006).

Methane emissions from rice paddies in China and their uncertainties
In 2010, the total rice harvest area of China was 29.9 M ha.The national total methane emissions were 6.44-7.32Tg depending on the spatial resolution used for modeling (Table 4).In each individual county, the within-cell standard deviation of methane flux, seasonal methane emissions per unit area, as calculated via Monte Carlo methods, was 13.5%-89.3% of the statistical mean.Because no errors were considered in the area from which rice was harvested, the relative uncertainty for methane emissions was the same as in the methane flux estimation.In the case of errors being present in the rice harvest area, the uncertainty of methane emissions in each cell can be calculated with Rule B of IPCC ( 2000) before aggregation.
When data sharing between counties was not accounted for, the falsely aggregated standard deviation was approximately 1.7% -2.2% of the national emissions according to the Law of Large Numbers.However, when the correlation of the model estimations for cells was considered (Table 1), the overall aggregated standard deviation was 16.3% of the total emissions, ranging from 18.3%-28.0%for early, late and middle rice ecosystems (Table 4).This finding implies that intensifying data quantities significantly reduces uncertainties in regional estimations by reducing data sharing and the correlations in the DS matrix.Assuming a Gamma distribution (Fig. B1 in Appendix B), the 95% confidence interval (CI) of the national total methane emissions, calculated via the moment-matching approach with m 0 and σ 0 , was 4.5-8.7 Tg at the S1 spatial resolution (Table 4).
The national methane emissions from rice paddies in China have been estimated in many previous studies.Table 5 lists those studies that included uncertainty assessments.With the exception of the results from Huang et al. (1998), in which higher emissions were produced because of the continuous flooding used for rice cultivation in the study, the uncertainties in all other studies largely overlapped with those of the present study, although significance levels for the uncertainties were not explicitly provided.The results of other studies (not listed in Table 5), e.g., Ren et al. (2010), Li et al. (2002) and Yao et al. (1996), also fell within the ranges listed in Table 4.Most of these previous studies focused on organic matter application and water regimes in their estimations of uncertainty (Table 5) because of data scarcity in these two factors.Taking into consideration the tremendous spatial heterogeneity of soil characteristics, Li et al. (2004) believed that these were the most sensitive factors accounting for uncertainties, and the uncertainty was between 2.3−10.5 Tg yr −1 (1.7−7.9Tg yr −1 C) for mid-season drainage irrigation and 8.5−16.0Tg yr −1 (6.4−12.0Tg yr −1 C) when continuous flooding was applied.Uncertainties of regional estimations come from many sources, including the model imperfection due to inaccuracy of parameters and structural fallacy of the model (e.g., Kennedy and O'Hagan, 2001), as well as the data errors and poor availability of the model inputs.A comprehensive uncertainty analysis should synthetically include all major uncertainty sources (IPCC, 2000;van Bodegom et al., 2002).In the present study, the within-cell variances of the five most sensitive factors, i.e., SAND, GR, OM, W ptn and VI, were parameterized and included in the Monte Carlo simulations, but there are also other factors that may contribute to uncertainties (van Bodegom et al., 2002).Moreover, there may be covariance between the input parameters.For example, the rice variety (VI) and/or soil texture (SAND) may have impacts on the irrigation applied (W ptn ).With sufficient data, we may quantify the correlations between the input parameters and then build a joint/Bayesian PDF of the input parameters (Kennedy and O'Hagan, 2001).Incorporation of correlations between the input parameters will improve the estimation of the within-cell variances.
However, facing the difficulty of data scarcity, it is necessary to parameterize the within-cell variance of each input parameter separately at present.Apart from data scarcity, model imperfections due to a poor understanding of the complexity of the ecosystem are also a primary source of estimation bias.A model comprises functions and equations that describe the physical processes of interest, but it cannot include every detail.Model inaccuracies may bias the estimation away from the true value, which is usually evaluated by model validation (Huang et al., 2004).In the present study, however, we did not incorporate the error of model inaccuracy in the uncertainty assessment.

Data scarcity, spatial resolution and the uncertainties in regional estimation
The uncertainty in regional methane emissions in Table 4 is primarily caused by errors and a scarcity of model input data (Fig. 2).Even if the data abundance of the model variables differ significantly (Fig. 2), modeling at a finer spatial resolution does help to reduce the estimation uncertainty (Table 4).We made the model estimations at three scales (S1, S2 and S3 in Table 4).At each scale, S1 for instance, the finer input (data of SAND, 10km×10km raster dataset) was aggregated to create input of SAND at the scale of S1.But to run the model at a specific scale, the data of the other model variables, i.e., OM, Wptn and VI, must be shared between neighboring grid cells because they are coarser than the specific grid size of S1.Table 4 shows the scale effects of the model estimations, the impacts of decreased variability of input on the model output.At each of the specific scales (S1, S2 or S3), the direct model output is of the variation in each of the grid cells (in a county at S1, a province at S2 or a GR at S3).In Table 4, the 95% CI was 3.4-12.3Tg when modeling was performed at a coarser resolution (S3).At the provincial scale (Scenario S2), however, the 95% CI narrowed to 4.8-10.4Tg, and the aggregated standard deviation was 19.5% of the national total emissions.However, without sufficient data support (Fig. 2), upscaling a model at an over-fine resolution makes no substantial difference, as in Table 4 for S1.
Although the uncertainty was reduced further when the spatial resolution was at the county level, this approach is not cost-effective, and the indicator I R rises rapidly from up to 3 at the provincial scale to more than 27 at the county scale (Table 4).The I R indicates the redundant cost; a higher I R indicates more redundant processing.
In Table 1, sharing data for the higher-sensitivity variable, e.g., SAND vs. Yield in Table B1, may result in a larger correlation coefficient C ij .Although C ij in Table 1 is computation intensive, needing a large number of modeling iterations, a rough estimation (Eqn.4) of C ij may be meaningful in finding the proper spatial resolution before the model upscaling is conducted: where s k is the sensitivity index of the model parameter k (e.g., Table B1 in the Appendix) and m is the number of model input variables under consideration.I ij,k is a binary variable taking a value of 1 or 0. If cells i and j share data for the model input variable k, I ij,k is assigned a value of 1; otherwise, it is 0. The sensitivity index s k reflects the difference in the importance of the model input variables to the model output.Fig. 3 presents the comparison of the correlation coefficients calculated in two ways.Though the rough estimation of C ij via Eqn. 4 differs to some extent from those in Table 1, the values exhibit the same trend in reflecting the impacts of data sharing on correlations of the model outputs between cells.

Conclusions
Data scarcity is a significant challenge in making regional estimates of greenhouse gas emissions.We developed a data sharing matrix to estimate the aggregated uncertainties in China's rice paddy methane emission introduced by data scarcity.
Based on the data sharing matrix, we estimated that data scarcity in the five most sensitive factors introduced an aggregated uncertainty to the estimates ranging from 4.5 to 8.7 Tg with a 95% confidence interval.Aggregated uncertainty may vary with the spatial resolution for a given dataset, and the indicator I ds is useful for identifying an appropriate spatial resolution.An appropriate spatial resolution corresponds to a value between 0 and 1 for the I ds , which represents a compromise between the data scarcity of different model variables.Improving the data abundance of model inputs is expected to reduce the uncertainties in estimating terrestrial greenhouse gas emission, in which the sensitivity of the model inputs also plays a key role.variables in the model.The Monte Carlo method is commonly applied to simultaneously produce variations of model inputs.
To scale the model input variation, the e/M is adopted for each of the variables to make them comparable to each other, and all the CH4MOD input parameters have positive values.In differential form, the expression e/M can be expressed generally as x dx or d(lnx).The purpose of the model sensitivity analysis in the present study is to explore the modeled methane flux variability to variations of the model input parameters as in formula (b 1): where k is used to identify each model parameter and y represents the seasonal methane emissions flux (g CH 4 m -2 ) calculated by CH4MOD with x k as input.S k is the sensitivity index of the model variable k, and it is defined as the linear coefficient for the relationship between methane flux and the model input variables in terms of fractal variation.
The Monte Carlo approach was adopted as the first step to randomly select values of the model input parameters from their value domains (Table B1), at which point the methane flux was calculated with CH4MOD.This picking-and-calculating procedure iterates for 20,000 cycles.After logarithmic transformation of the model inputs and outputs, a simple variable linear regression was performed, and the sensitivity index was defined as the slope coefficient of the regression equation.
Water management in rice cultivation is a key factor that impacts methane emissions from rice paddies.In CH4MOD, the diverse water management strategies in Chinese rice cultivation are grouped into five irrigation patterns and include flooding, drainage and intermittent irrigation (Huang et al, 2004).In the case of this nominal variable, the sensitivity index was calculated as follows:  B1).N is the total number of (j, k) pairs, and  where the function R(v 1 ,v 2 ) returns a random number between v 1 and v 2 .S s and S e represent the transplanting and harvesting dates, respectively, and S max is the day on which the air temperature reaches its maximum for the rice season.The time variable t (S s ≤t≤S e ) represents days after transplanting.
The results indicated that methane emissions are most sensitive to field irrigation, with a sensitivity index of 0.67 (Table B1).The soil texture, rice variety and organic matter application rank lower, with sensitivity indices of 0.63, 0.51 and 0.47, respectively.

Fig. 1
Fig. 1 presents a flowchart of model upscaling in the case study.The solid arrows (3), N is the total number of cells (divisions) that partition the entire region under consideration and N k is the number of data points for the model variable k.When the off-diagonal elements of the sharing matrix are all 0, indicating abundant data (no sharing) among the cells for all the model input variables, I ds =0 and I R =1.The other extreme, when the off-diagonal elements of the DS matrix are all 1, indicates a severe data scarcity and complete data sharing among the cells for every model input variable, I ds =1 and I R =N.

Research
Development Foundation of China (Grant No. 2010CB950603).We thank the Resources and Environmental Scientific Data Center of the Chinese Academy of Sciences and the National Meteorological Information Center of the Chinese Meteorological Administration for their support in providing the data.
in the formula (b 2) is the code set of the irrigation water patterns (Table methane flux for irrigation water pattern l, k and all water patterns, respectively.To run the CH4MOD simulation, daily air temperatures must be available for the duration of rice growth from the dates of transplanting to the harvest.In the model sensitivity analysis, the temperature data are virtually created by the