SBDM v1.0: A scaling-based discretization method for the Geographical Detector Model

.


Introduction
Spatial data analyses are useful for analysing continuous data, such as temperature or altitude.Because of the complexity of continuous data, it is often less efficient to completely understand all the information contained in the data (Catlett, 1991;Kerber, 1992;Richeldi and Rossotto, 1995;Frank and Witten, 1999;Xu and Ii, 2005;Cheng and Lu, 2017).Discretization is a data processing procedure that partitions continuous data into a set of strata, which can be used to understand the geographical relationships among environmental processes and support decision-making.

Discretization methods for continuous data
There are one-dimensional and multi-dimensional discretization methods used in geoscientific research.Here we focus on one-dimensional methods and briefly summarize the frequently used discretization methods.
(1) Equal Interval (EI) equally divides data into specified ranges, but without taking data distribution into account (Ren et al., 2014;Hu et al., 2015).(2) Natural Breaks (NB) (Jenks) and one-dimensional K-means (MacQueen, 1967) both seek intervals that have the smallest in-class variance and highest variance between classes (Wu et al., 2016;Hong et al., 2017;Shrestha and Luo, 2017;Du et al., 2017).( 3) Quantile (QU) classifies data into specified classes that contain an equal number of elements or units.Each quantile places similar elements in adjacent classes and the elements demonstrate large differences for the same class.This method is suitable for data with a linear distribution (Luo et al., 2016;Liang and Yang, 2016).(4) Geometrical Interval (GI) method is based on a principle that the sums of the squares of the number of elements for each class are minimal.
The biggest advantage of this method is that it can handle non-normal data (Cao et al., 2013;Tian et al., 2017).The four methods above are integrated into spatial analysis software (e.g., ArcGIS®) and are the most commonly used methods for univariate discretization.(5) Systematical Clustering method sorts the data by the distance criteria (e.g., Euclidean distance, Mahalanobis distance).The categories are mostly reduced until the appropriate classification requirements are met (Liao et al., 2016).Some studies also use prior knowledge to categorize the continuous data (Xu et al., 2017;Zhou et al., 2018).
Although there are several discretization methods for data mining, two fundamental principles for discretization should be satisfied (Cao et al., 2013).Firstly, a good discretization method always attempts to minimize the information losses (maximizing the within-group similarity and minimizing the between-group similarity when considering only the univariate), such as Natural Breaks and K-means methods.Secondly, the results of discretization should be as suitable as possible for the subsequent models.If we arbitrarily discretize continuous data before modelling, the subsequent analysis and application may miss valuable information or lead to misspecification of models, e.g., Boolean networks (Kauffman, 1969), generalized logical networks (Song et al., 2009), or Sandwich Interpolation model (Wang et al., 2002(Wang et al., , 2013;;Liu et al., 2018).Therefore, the effectiveness of the discretization algorithm is not only related to discrete data distribution, but also related to the subsequent application models.The degree of information loss is the best criteria to evaluate the discretization methods if there are no subsequent models.Quinlan (1986) proposed a concept of information entropy measuring information loss before and after discretization of continuous data.Cang and Luo (2018) defined a concept to assess the effect of discretization on data loss based on the spatial data association, and suggested that the Quantile discretization method can minimize information loss.
However, the unified criteria for evaluating the discretization effect is to maximize the accuracy of the model results as much as possible when discretized data is used as an input data to subsequent model.

Introduction of the Geographical Detector Model
In geographical research, most purposes of discretization are restricted to divide continuous data for efficient cognition without establishing a geographical data mining model.Based on the spatial variation theory, Wang et al. (2010) presented a new technique: Geographical Detector Model (GDM), which was developed from medical geography and has now been applied to many geographical fields.GDM is designed to measure spatially stratified heterogeneity of a response variable and reveal the impact of driving factors.For example, the geographic elements (X and Y ) of the area A can be represented using a grid in Geographic Information Systems (GIS) (Fig. 1).In this case, X is an environmental factor and Y is a response variable and they have potential spatial relationships .The hypothesis of GDM is that if an environmental factor (X) contributes to a response variable (Y ), there may be a similar spatial distribution for the two, and this similarity or spatial association can be measured by the power of determinant (P D) value (Wang et al., 2010) or q-statistics (Wang et al., 2016).If the variable X is the continuous data, the first step of GDM is to discretize it into categorical data or strata, i.e., X to X h (blue arrow in Fig. 1), and then based on the spatial pattern of strata of X, GDM calculates the ratio between the sum of spatial variance of Y within each stratum and population variance (X h ∼ Y , red arrows in Fig. 1).The P D value (q value) can be expressed by Eq. ( 1) (Wang et al., 2010), where N denotes the units of study area (N = 32 in Fig. 1) stratified into h = 1, 2,..., L strata (L = 5 in Fig. 1).σ 2 means the stratum variance.The value of q is required to be within [0, 1] (q = 0 when there is no stratified heterogeneity in Y , and q = 1 when Y is completely stratified).In fact, if X and Y are two different variables, the q value (X h ∼ Y ) is defined as the driving force of environmental factors (X) to a response variable (Y ).However, if we replace the Y with X, the q value (X h ∼ X, green arrow in Fig. 1) can be a measure of spatial stratified heterogeneity of X. Spatial stratified heterogeneity refers to the degree of within-strata variance less than the between strata variance.It is a common phenomenon and essence of nature in extensive geoscience (Wang et al., 2016).Based on q-statistic, four detectors have been derived from GDM: Factor detector, Ecological detector, Risk detector, and Interaction detector.

Imperfections of the Geographical Detector Model
GDM is based on spatial variance analysis of strata, hence, the q value (Eq.( 1) is dependent upon the way continuous variables are discretized.Therefore, the key factor affecting the accuracy of GDM is the discretization methods (X to X h , the blue arrow step in Fig. 1).
For X h ∼ X, the two most common algorithms for discretizing data with the highest spatial stratified heterogeneity, namely Natural Breaks (Fisher-Jenks algorithms) (D.Fisher, 1958) and K-means (K-means algorithm) (Hartigan and Wong, 1979).
These two methods have equivalent criteria that function as the q-statistic in the geographical detector.In Natural Breaks, the criteria function is the goodness of variance fit (GVF) (Dent et al., 1999), which has the same meaning and formula as qstatistic.In K-means, the criteria function is to minimize the within-cluster sum of square (i.e., variance), which is equivalent to the numerator ( L h=1 N h σ 2 ) in q-statistic.A faster algorithm, referred to as Dynamic Programming, emerged for optimizing one-dimensional K-means and is available as "Ckmeans.1d.dp" in the R Package (Wang and Song, 2011).
For X h ∼ Y , the power of determinant q value can be used as a quantitative evaluation index to the effective degree of continuous data discretization.A larger q value indicates that the feature has a higher contribution to the discretization degree (Liao et al., 2016).When we use one discretization method to divide the spatial data into different strata, the q value is relatively higher, suggesting that the spatial distribution pattern of X h has a stronger influence on Y .In this application, discretization is the key step that has great impact on the model results.
In the geographical detector application, different discretization methods may lead to disparate q values (Cao et al., 2013;Huang, 2014;Ju et al., 2016;Zhao et al., 2017).A number of studies (Cao et al., 2013;Ju et al., 2016;Zhao et al., 2017) tried to look for a larger q value using the aforementioned discretization methods .When assessing the driving force of environmental factors (X) to the response variable (Y ), those traditional discretization methods are mainly based on the distribution of data X to be stratified.However, q is a comprehensive value which is not only related to the distribution of X itself, but also associated with the distribution of Y .Therefore, the conventional discretization methods might be not suitable for GDM.It is imperative to propose an appropriate discretization method to maximize the q value and ensure optimal stratified continuous data.

Aims of this study
In GDM, the q value as a criterion can assess the spatial stratified heterogeneity and measure the power of determinant of X to Y (Wang et al., 2016).Following this line of reasoning, we provide a new discretization method based on the scale transformation theory that takes the magnitude of q-statistic of GDM as the criteria function to discretize the continuous environmental factors into categorical strata for subsequent application in GDM.Details of our method to obtain the q value for each spatial pattern and to speed up the calculation are put forward in Sect. 2. The case studies and software instructions of this method are presented in Sect.3, which is followed by the results, discussion, and conclusion in Sects.4 and 5.

SBDM software development
Based on the principle of different amounts of information at various scales (Burt and Adelson, 1987;Portilla et al., 2003), we propose an efficient discretization algorithm: Scaling-Based Discretization Method (SBDM).Taking the q-statistic of GDM as a criterion function, the design of the SBDM algorithm includes three parts: (1) listing all cut points which are used to classify continuous data into discrete strata and upscaling the data to reduce the number of the cut points, (2) based on the current cut points, making an exhaustive search to find the optimal results (the maximum q value and the corresponding cut points in the current scale), and (3) setting the buffer scale for each optimal cut point in the last scale to get the next calculated cut points and then making an exhaustive search algorithm again to find the most accurate results in the current scale in a short time, this step is called downscaling.After that, if the current cut points precision reaches the highest, the algorithm ends, if not, we then loop steps 2 and 3. Figure 2 shows the conceptual architecture of our algorithm.

Upscaling
Exhaustive search that enumerates all possible combinations of cut points and select the one that most closely meets the condition is an effective method for obtaining the global optimal strata.For the optimal discretization, it needs to take all the stratified patterns (combinations of different cut points) into account and select the one that corresponds to the maximum q value.Here, we illustrate the algorithm briefly with an example (Fig. 3).Based on a raw dataset consisting of 9 × 9 cells (Fig. 3a), we firstly transform the data into a unidimensional array and remove the duplicates.We then obtain 16 cut points (i.e., represented by red lines between different numbers) and 16 combinations by discretizing the raw dataset into two categories (strata).Using exhaustive search algorithm, the quantity Q of all combinations is determined by the Combination Formula (Eq.( 2)).
where N and P refer to the number of different values of the response variables and categories appointed by users, respectively.
Supposing that we divide the raw dataset into two strata (i.e., selecting one cut point), there are 16 combinations of cut points using Eq. ( 2) (i.e., Q = 16).
The main disadvantage of exhaustive search is that numerous combinations exist and increases exponentially with the number of cut points.Most geographic elements have properties with a wide range of values (e.g., in complex topography, elevation may range over thousands of meters).The calculation has high time complexity and cannot be accomplished by a common computer.Therefore, we designed an upscaling method to divide the raw data by a pre-specified scale number to decrease the quantity of the cut points.Figure 3b is obtained by dividing the raw data (Fig. 3a) by three.Then, Fig. 2b has five cut points and five combinations when two strata are assumed.Processing that occurs during upscaling dulls the boundaries of strata (fewer red lines in Fig. 3b).The optimal cut points in the raw data and the upscaled data are adjacent in spatial position (e.g., when the optimal cut point is 19 in Fig. 3a, the optimal one could be 6 in Fig. 3b, the positions of 19 and 6 are close).The upscaling processing seems powerful for data compression in efficaciously reducing the amount of calculations, but it generates two defects: 1. Cut points drift.After upscaling processing, each of the five cut points in Fig. 2b arises from merging the raw data (e.g., cut point 6 in Fig. 2b corresponds to [18,19,20] in Fig. 2a).This merging may cause cut points to drift.The drift will be nearby the original optimal cut points.Supposing 19 is the optimal cut point in Fig. 2a, for such a case, 19 divides Fig. 2a into two strata: [10, 19) and [19, 26], which correspond to the maximum q value.19 corresponds to 6 or 7 in Fig. 2b.Exhaustive search lists all cut points in Fig. 2b, there are two possibilities: (1) If 6 is the optimal cut point (the cut point 18 in the raw data), it reduces q value comparing with that when 19 is the optimal cut point in the previous hypothesis.(2) If 7 is the optimal cut point, the subsequent 21 is the optimal one in the raw data, it also reduces q value.Finally, depending on the corresponding larger q value, the optimal cut point is either 6 or 7 in Fig. 2b.
2. Low accuracy.Upscaling processing may cause loss of precision so that we cannot find the exact location of the optimal cut points in the raw data.For instance, the cut points in Fig. 2b might either be 6 or 7, the corresponding optimal one might be one of the values [18,19,20,21,22,23] in the raw data so that we cannot find the most accurate cut points under the current upscaling scale number, three.

Downscaling
We designed a downscaling method that addresses the above problems, while extracting the exact cut points in a short time.
The procedure of downscaling method is to set multiple buffer scales for the raw data.We can obtain the data in Fig. 2c by dividing the raw data in Fig. 2a by the number two and rounding the quotient down.Instead of calculating all combinations in Fig. 2c, we set a rule for the cut point using the following: In the results of the last scale S last (i.e., S last = 3 for upscaling), the optimal cut points are P i (i = 1, 2, ...) .Based on P i (i = 1, 2, ...) and taking ∆σ as the neighbourhood (sliding number in SBDM), we obtain the range of cut points in the next scale (S next ).In Fig. 2b, when the optimal cut point is 6, the sliding number ∆σ is designed as 1, then the range [5,7] in Fig. 2b corresponds to [15,21] in Fig. 2a.Next, we divide the set [15,21] by the number two (next scale number), the subsequent data set [7,10] in Fig. 2c are the all cut points to be calculated.Other cut points [5,6] and [11,13] do not be considered.Supposing the optimal result is 10 in Fig. 2c, the next calculation range in the raw data could be [18, 22] using Eq.
(3).Finally, there are only five combinations needed to be calculated to get the maximum q value.During the downscaling, the q value will increase, and the dissimilar data will decline in each stratum, because there are more highly accurate cut points to suit the spatial distribution of Y .
While for cut points with a small interval, all cut points need to be treated as a group instead of individuals to be listed through exhaustive search.For example, assuming that the raw data in Fig. 2a need to be divided into three strata and the data in Fig. 2b have two cut points, 5 and 6 ([3, 5), [5,6), [6,8]), the set of the first and second cut points in Fig. 2c are [6,7,8,9] and [7,8,9,10], respectively.Now, we need to decide the category of cut points [7,8,9].In this case, we treat the two sets of cut points as one category for exhaustive search, for example the data sets [6,7,8,9] and [7,8,9,10] can be merged into one category [6,7,8,9,10].We conclude that when the range of cut points shows narrow distributions, it may cause conflict of cut points allocated during downscaling.The solution is to merge all the cut points into exhaustive search.In SBDM, the option to implement this feature is demonstrated as Set Cut Points.

Case study
We designed two application modes, Line and Surface, to apply SBDM for different data types.The Line mode is designed for the situation when the environmental factor X is a linear element including highways, rivers, and pipelines.Users can detect different amounts of impact to the response variable with the Line mode in SBDM.The Surface mode is used to quantify the power of determinant of surface data to the response variable.For these two application modes, we implement two case studies.
(1) Using the Line mode, we detect the authentic influence power of rivers on Sand Cover Ratio (SCR) in Maowusu (Mu Us) Sandy Land, northern China.Then we compare the results from SBDM with the results from Liang and Yang (2016). (

Datasets
Mu Us Sandy Land (34,500 km 2 ) is in the center of Loess Plateau, China, and surrounded by the Yellow River (Fig. 4a).The study area (6,201 km 2 ) in our case is a part of Mu Us Sandy Land, which contains a portion of tributaries of the Yellow River (perennial river) and some intermittent streams (Fig. 4c).In this case study, we set the closest distance (Euclidean Distance) to the river of each non-river points in the study area as the X variable and SCR as the response variable, Y .The river location is determined based on the interpretation of multi-period remote sensing images and corrected according to high resolution Google Earth images.The SCR is a quantization value of landscape spatial patterns obtained via Eq.( 4) after sand dune and vegetation zone landscape types were converted to binary values.We applied the Landsat 8 OLI images for supervised classification with maximum likelihood classification method provided by ENVI®5.3 to obtain the landscape types in Mu Us Sandy Land with the classification result shown in Fig. 4c.Then, the SCR can be obtained by Eq. ( 4) (Liang and Yang, 2016), where the denominator is the sampling area and the numerator is the area of sand dunes in each 1 km 2 .Figure . 4d shows the distribution of SCR.For more details of SCR, readers can refer to Liang and Yang (2016).
The second case is undertaken in the Xinjiang Uygur Autonomous Region (1,660,000 km 2 ) located in the border of northwestern China.Figure .4b shows the NDVI distribution in 2013 after processing from the maximum value composite method (Holben, 1986)

Steps of SBDM operation
SBDM can take Image (*.tif) and Text (*.txt) files as the input data formats to support the analysis of spatial and non-spatial data using GDM.Here, we take image formats as an example.All of the above data are made into the GeoTIFF format (*.tif) file, which is one of the widely used raster file format in GIS (Bernard and Ostlander, 2008), we can demonstrate the steps of using the software are as follows: Step 1.We constrain each image to be the same size and spatial extent, which is the basic condition for the raster calculation.
Then, all digital numbers for environmental factors (X) are converted into integer format, which is required for input into the SBDM software.For example, the range of WV in the second case study [2.2, 5.9] is multiplied by 10 to convert into [22,59] and avoid loss of precision.Step 2. Mode selection.Firstly, we define the types of environmental factors, X.Then, we select an appropriate mode: Line mode (investigating the strata of influence of a line type data such as a river) or Surface mode (detecting the power of environmental factors of surface data type to the response variables).
Generally, the order of the computing procedure must experience one-time upscaling and several times downscaling.In this part, we first show an example of Line mode with a general desktop PC (i7-2600 CPU, 4-core, 3.4GHz, 8G RAM).Before using the Line mode, there are two pre-treatments in GDM: (1) If the line element data is a shapefile format (i.e., Polyline [*.shp]), we calculate the minimum distances between non-river points and river points in raster format.This step can be completed using the Euclidean Distance tool in toolbox of ArcGIS®10.3software (ESRI, 2015).The size of pixels of the resultant raster is the same size as the data Y (SCR).( 2) In geographic science, the degree of influence of linear geographical elements is always reduced as the distance increases and will be not be affected beyond a certain distance.Therefore, we set a parameter in SBDM to reduce the cost of calculation, whereby Extremum Number refers to the greatest distance of the linear geographical element to Y .For example, Liang and Yang (2016) show that an Extremum Distance of 21 km indicates that beyond 21 km the rivers have few or no influence on SCR.Here, we also take 21 km as the extremum number.
Step 3. Upscale processing.We set the upscaling number according to the range and size of X, and the strata number is user specified.Here, the distance range of the river is [0, 27730] m, and the size is 117 × 53 units.The scale number of upscaling is set as 900, meaning that there are approximately 31 cut points (27730/900 ≈ 31) and 593,775 combinations when the strata number is defined as seven.After pushing the run button, it takes four seconds to finish the calculation (Fig. 5).The q and cut points shown in the Upscaling Results box are the optimal values at the current scale (i.e., 900).
Step 4. Downscale processing.The purpose of this step is to obtain more accurate q values and cut points.In step 3, the optimal cut points are under the scale of 900 (defined as the Last Scale number in this step).The downscaling path could be 900-1 directly or step by step (e.g., 900-300-1) to get the most accurate results.In this example, we compare the downscaling path 900-1 and 900-300-1.The results show that the path 900-300-1 (19 seconds) was less time consuming but produced the same results as 900-1 (603 seconds), shown in Fig. 5. Finally, the results shown in Downscaling Results box are the maximum q value (0.1545) and the corresponding optimal cut points (0,1000,1414,2236,3000,4242,10049,27730) that are most accurate.
The above steps show the instructions of the Line mode in SBDM that are used to find the best strata for the influence of line element (river) to the response variables (e.g., Sand Cover Ratio).For the Surface mode, the only difference is that there an extremum number is not required as input.The detailed meanings of each button and I/O boxes in the software and how to use the Text files (*.txt) as an input data, readers can refer to the SBDM _Sof tware_Instruction_M anual.docx in the Supplementary files.respectively.The discretization results (Strata), the average value SCR in each stratum and q values from the SBDM and Prior Knowledge are shown in Table 1.It can be seen that two discretization methods yield very different results.In SBDM, the strata [1414,2236) have the highest value of SCR, suggesting that this range of distance from river is more likely to form the sandy land.While in the results of Prior Knowledge, we only know that the strata [1000, 3000) have the highest contribution rate to sandy land formation.Furthermore, the strata [3000, 4242) also have a higher driving force to the sandy land forming but we cannot find this information in the Prior Knowledge method.In addition, the q value is 0.154 in SBDM which is higher than the value 0.136 in the Prior Knowledge.That is because the field investigation has certain limitations and the mechanism of the effects of rivers on vegetation is complicated.Different levels of influence distance defined by Prior Knowledge are commonly not accurate enough, which may result in imprecise (lower) q value in the GDM.Therefore, according to the response variable itself to stratify the environmental factors, SBDM can reveal a more precise biotope range of the geographical variables.
4.2 Impacts of seven environmental factors on NDVI spatial pattern in Xinjiang, north-western China (Surface Mode) Figures 7 and 8 show the results of the factor detector for the impacts of seven environmental factors on the spatial pattern of NDVI in Xinjiang, north-western China, with different discretization methods.For each factor, comparing with other methods, the q value is the maximum when SBDM is applied.The q value can be used to measure the degree of the impact factor on the response variable.For larger q values, the environmental factor will have a greater impact on the geophysical variable.The impact power can be quantified as q × 100 % (Wang and Xu, 2017).SBDM results show that the order of determinant power of seven factors to the NDVI are roughly the same for all discretization methods (i.e., the q value of PREC > WV > TEMP > SD > SH > SA, except for TEMP > DEM < SD in the GI method, see F actor_Detector.xlsx in the Supporting Information).Note that q values of DEM with different discretization methods are quite distinct.It might be caused by the large ranges of DEM from -157 m to 7913 m, which can generate thousands of cut points and result in obviously different q values.In addition, the results of factor detector reveal that the TEMP (q = 0.175) and DEM (q = 0.174) factors have roughly the same impact power, while in the GI method, TEMP (q = 0.159) is more dominant than DEM (q = 0.046).
The ecological detector assesses the relative importance of two factors on the spatial distribution of determinant variables (Wang et al., 2010).The ecological detector also shows that TEMP and DEM have no significant difference with SBDM (p = 0.00, see Ecological_Detector.xlsx of the Supporting Information).Therefore, the SBDM software allows us to obtain a realistic discretization and confirm the ratio of the power of determinant between different environmental factors.
The risk detector calculates the mean value of each stratum for one factor and tests whether the mean value is statistically significantly different with that of another stratum of the factor, revealing where the risk areas are.Bigger differences suggest more risks to the response variable within the stratum (Cao et al., 2013).One more time, the results are closely associated with the cut points determined by the discretization methods.In the case study, the risk detector results include the average value of each stratum and p value of student t test using five different discretization methods (see Risk_Detector.xlsx of the Supporting Information).Using SA as an example, we find that the average value of first stratum using SBDM is about two times higher than other discretization methods, indicating that the first level of SA has a strong influence on vegetation formation.Therefore, by finding out the differentiation between strata and improving the consistency of data within each stratum, SBDM can effectively detect the relationship between different level strata of environmental factors and response variables.
The interaction detector quantifies the interactive effect of two factors to the response variable.For example, comparing with the effects of an independent factor, we examine whether the temperature and elevation interact or independently contribute to the spatial pattern of NDVI.The interaction detector can solve the collinearity problem of independent variables.We expand the interaction quantities from the traditional two to seven factors to make a comprehensive comparison (Wu et al., 2017;Shrestha and Luo, 2017;Zou et al., 2017).Our results show that the interactive effect combination with seven factors can only account for 65.9 % of the influence from all underlying factors (see Interaction_Detector.xlsx of the Supporting Information).
It suggests that about 34.1% of the influence, because of other factors, has been neglected when selecting the current seven environmental factors.However, when QU method is applied, the interactive q values of seven factors are 0.67, which is greater than the q value from SBDM, implying that single factor optimal discretization does not represent the optimal one when the interaction detector is applied.Because there are more factors considered by interaction, a complete evaluation may result in a more complicated combination.Hence, the optimal discretization of a single factor with the maximum q value might not produce the maximum interacted q.The q values derived from a single factor and the interacted one might be independent.At present, we cannot judge the effect of discretization methods by the interactive q value (Cao et al., 2013).Future work should explore mechanisms that explain the interactive effects between factors.

Report of calculation
The SBDM software is able to run on a desktop PC with no special hardware (e.g., Graphics Processing Unit).In this paper, we have made a test report for the results of the SBDM software based on the second case study.The report is shown in Report.xlsx in the Supporting Information.Using SH as an example (Table 2), it takes the SBDM software about 1.5 minutes to process the data, which is much improved compared to 7.6 days when using exhaustive search directly to acquire the most accurate cut points.Moreover, when an exhaustive search method processes a large range of factors, such as DEM in this study, computing time could be 1.4 million years to get the better cut points without downscaling processing on a desktop PC environment, but the SBDM software only spends 1.5 hours.Collectively, we demonstrate that using exhaustive search to list all possible combinations and using SBDM have the same consequence of q value and cut points, but the SBDM software takes an extremely short time to obtain the optimal discretization results.In addition, according to the time costing report, more computing time is required when there is a larger interval between successive scales, particularly when downscaling is close to the highest resolution (i.e., scale 1).Strata [0, 1000Strata [0, ) [1000Strata [0, , 1414Strata [0, ) [1414Strata [0, , 2236Strata [0, ) [2236, 3000) [3000, 4242) [4242, 10049) , 3000) [3000, 4242) [4242, 10049)   b The data without * markers are the results from downscaling based on the last scale, e.g., for SH factor, the results (cut points, time costing and q-value) of scale 4 is based on the results of scale 6 when downscaling is applied (downscaling path: 6-4) and scale 2 is based on scale 4 (downscaling path: 4-2).

SBDM
c The USTC and DSTC represent upscaling and downscaling, respectively, time costing for different sliding numbers ((∆σ).d The letters "s", "m", "h" and "d" represent seconds, minutes, hours and days, respectively.
) Using the Surface mode, we quantify the Normalized Difference Vegetation Index (NDVI) dominant controlling factors and the significance of the rank orders for its environmental factors.Seven environmental factors include Elevation (ELEV), Slope Degree (SD), Slope Aspect (SA), Precipitation (PREC), Wind Velocity (WV), Temperature (TEMP), and Specific Humidity (SH).We compare NDVI to the environmental factors in Xinjiang Uygur Autonomous Region, north-western China and compare the results from SBDM with the results from four conventional discretization methods (EI, NB, QU, GI) as described in Sect.1.We then test the widely used four detectors (Risk, Factor, Ecological, and Interaction) of GDM with SBDM.Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-274Manuscript under review for journal Geosci.Model Dev. Discussion started: 30 November 2018 c Author(s) 2018.CC BY 4.0 License.

4. 1
River influence on SCR in the Maowusu (Mu Us) Sandy Land, northern China (Line Mode) Based on the field investigation or prior knowledge, Liang and Yang (2016) stratified the buffer region of the river to seven strata.Figures.6a and 6b display the discretization results of the river distance from the SBDM and Liang and Yang (2016), Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-274Manuscript under review for journal Geosci.Model Dev. Discussion started: 30 November 2018 c Author(s) 2018.CC BY 4.0 License.

Figure 2 .
Figure 2. Conceptual architecture of the algorithm with the upscaling and downscaling processing steps in the SBDM software.

Figure 3 .
Figure 3. Diagram illustrating the steps of upscaling and downscaling processing for the key algorithm of the SBDM software.(a) Raw data consisted of 9 × 9 cells.(b) Upscaling result from the raw data divided by the scale number three.(c) Downscaling result from the raw data divided by the buffer scale number two.All data rounded down as the integer format.The red lines represent the cut points which divide the data into discrete strata.

Figure 4 .
Figure 4. Two case studies.(a) Location of the two study areas (Maowusu (Mu Us) Sandy Land and Xinjiang) in China.(b) The spatial pattern of N DV I (1 km × 1 km resolution) in Xinjiang, north-western China in 2013 processed by the MVC method.(c) Binary map (sand dune and vegetation zone) of the landscape and river distribution in the Maowusu (Mu Us) Sandy Land, northern China.(d) Sand Cover Ratio map derived from Fig. 4c via zonal statistics method (Liang and Yang, 2016).

Figure 5 .
Figure 5.The processing window of the SBDM software for obtaining the optimal strata of river distance to SCR in the Maowusu (Mu Us) Sandy Land, northern China.

Figure 6 .
Figure 6.Stratified map of river distance map in the Maowusu (Mu Us) Sandy Land, northern China.(a) Stratified map of river distance obtained by SBDM.(b) Stratified map of river distance determined by Liang and Yang (2016).

Figure 7 .
Figure 7. Stratified map of seven factors in Xinjiang, north-western China, with different discretization methods.The first column shows the raw continuous data distribution.Note the distinct differences of the stratified map from SBDM and other discretization methods.

Figure 8 .
Figure 8.Comparison of the q values for seven factors to the contributions of NDVI spatial pattern in Xinjiang, north-western China, with different discretization methods (see F actor_Detector.xlsx in the Supporting Information for detailed q values and cut points).Note that the red dots are above other symbols, suggesting the larger q values from SBDM.

Table 1 .
Stratified results from SBDM and Prior Knowledge for the river impact distance to SCR in the Maowusu (Mu Us) Sandy Land, northern China.

Table 2 .
A comparative analysis of the results for the effects of Specific Humidity (SH) factor on NDVI spatial pattern in Xinjiang, northwestern China, with and without downscaling processing of the SBDM software.