An integrated user-friendly ArcMAP tool for bivariate statistical modelling in geoscience applications

. Modelling and classiﬁcation difﬁculties are fundamental issues in natural hazard assessment. A geographic information system (GIS) is a domain that requires users to use various tools to perform different types of spatial modelling. Bivariate statistical analysis (BSA) assists in hazard modelling. To perform this analysis, several calculations are required and the user has to transfer data from one format to another. Most researchers perform these calculations manually by using Microsoft Excel or other programs. This process is time-consuming and carries a degree of uncertainty. The lack of proper tools to implement BSA in a GIS environment prompted this study. In this paper, a user-friendly tool, bi-variate statistical modeler (BSM), for BSA technique is proposed. Three popular BSA techniques, such as frequency ratio, weight-of-evidence (WoE), and evidential belief function (EBF) models, are applied in the newly proposed ArcMAP tool. This tool is programmed in Python and created by a simple graphical user interface (GUI), which facilitates the improvement of model performance. The proposed tool implements BSA automatically, thus allowing numerous variables to be examined. To validate the capability and accuracy of this program, a pilot test area in Malaysia is selected and all three models are tested by using the proposed program. Area under curve (AUC) is used to measure the success rate and prediction rate. Results demonstrate that the proposed program executes BSA with reasonable accuracy. The proposed BSA tool can be used in numerous applications, such as natural hazard, mineral potential, hydrological, and other engineering and environmental applications.


Introduction
Techniques to predict a response variable given a set of characteristics are required in several scientific regularities.Numerous applications have been implemented in various areas of geosciences.Bivariate analysis is one of the simplest methods of statistical analysis and is popular in numerous fields of study.Mathematicians, statisticians, biologists, and hydrologists use this method to perform their analysis.Different types of bivariate statistical analysis (BSA) have been established, for example, frequency ratio (FR), weight of evidence (WoE), and evidential belief function (EBF) (Yalcin, 2008).Although each of these methods requires specific mechanisms for calculation, all of these methods operate by using the same concept.Environmental scientists model various natural conditions by using the BSA statistical method.For instance, Ozdemir (2011) employed this technique for the same purpose.The results of the analysis were plotted in ArcGIS after computation in other programs.Mineral potential mapping is also aided by BSA techniques.Carranza (2004) used WoE modelling to map the mineral potential in the administrative province of Abra in northwestern Philippines.Their achievements indicate the plausibility of WoE in the mineral potential mapping of large areas with a small number of mineral prospects.Researchers have applied WoE in mapping mineral potential (Bonham-Carter et al., 1989) and it remains popular in this area of research (Carranza et al., 2008).
BSA is in demand in hazard studies because its procedure is simple and efficient.This technique has been used in natural hazard applications by researchers to predict the spatial distribution of events.Extensive literature on different BSA techniques and their proficiency assessment are also avail-able.BSA techniques can be used as a simple geospatial analysis tool to determine the probabilistic correlation among dependent variables (produced by using the inventory map of a hazard incidence) and independent variables (conditioning factors) containing multi-categorized maps (Oh et al., 2011).In BSA, the overlay of conditioning factors and computation of hazard densities, the significance of each factor, or the particular mixture of factors can be investigated individually.Bivariate statistical analysis functions by using a dependent variable and one conditioning factor.Hence, the significance of each factor is investigated separately (Porwal et al., 2006).
In BSA, each conditioning factor is overlaid with the dependent variable.On the basis of the event density, weights are measured for each class of each factor.By using normalized weights (the correlation between the event density in each class of conditioning factor and the event density of the entire region), each conditioning factor is reclassified and the hazard map is produced.By using the acquired weights, decision rules can be produced on the basis of the knowledge of experts.Conditioning factors can also be combined to generate a map with uniform units, which is then overlaid with the inventory map to provide the density per class.The BSA approach has been used in landslide mapping (Constantin et al., 2011), earthquake studies (Xu et al., 2012b), flood susceptibility mapping (Tehrany et al., 2013), land subsidence (Kim et al., 2006;Lee and Park, 2013), and risk analysis (Hu et al., 2009).Numerous studies have been conducted to exploit the potential application of BSA in the hazard domain.
This research examined the efficiency of statistical analysis, particularly bivariate analysis, in landslide studies in the Cuyahoga River watershed (Nandi and Shakoor, 2010).In another study, FR and WoE were applied in the Sultan Mountains of southwestern Turkey to map areas that are susceptible to landslides (Ozdemir and Altural, 2013).According to Nandi and Shakoor (2010) and Ozdemir and Altural (2013), the BSA model is simple and its input, computation, and outcome procedures are effortlessly understood.The application of EBF in the area of landside studies has been investigated (Lee et al., 2013).Four functions, namely degree of belief (Bel), degree of disbelief (Dis), degree of uncertainty (Unc), and degree of plausibility (Pls), are calculated separately to determine EBF.
Each of these functions produces valuable information.However, each function requires individual computations with specific formulas.Tien Bui et al. (2012) used EBF and fuzzy logic methods in their research and found that the landslide susceptibility map derived from EBF has the highest prediction ability.They also established the efficiency of BSA in landslide mapping.
BSA is also popular in hydrological research.Flood susceptibility maps assist in mitigation strategies.Lee et al. (2012) used the statistical method of FR to produce a map of flood-prone regions in Busan, Korea, in a geographic information system (GIS).Tehrany et al. (2013) proposed an ensemble method of FR and logistic regression (LR) to detect regions with high flood probability in Kelantan, Malaysia.The conditioning factors were reclassified on the basis of the weights acquired from the FR technique.These factors were entered in LR processing to obtain the multivariate statistical analysis (MSA) result.If the calculation time for these statistics can be reduced, the efficiency of the developed ensemble method will be enhanced.Hence, producing a tool that is capable of performing BSA calculations will help reduce the calculation time of ensemble methods.
The BSA model has been widely used in land subsidence susceptibility mapping.In a study by Lee and Park (2013), the FR model was applied and compared with the machine learning of decision tree (DT).The BSA is a method that is commonly used in natural hazard investigations.Although this method is not novel, the use of BSA has increased in recent years.Remote sensing (RS) and GIS have revolutionized the domain of natural hazards (Jebur et al., 2013a, b).A spatial database consists of different data types that are required to be transferred from one format to another because specific programs accept only specific data formats.Scientists have started to develop new programs in hazard studies because of the vital role of early warning systems in such applications (Osna et al., 2014;Pradhan et al., 2014).GIS is capable of storing, analyzing, and showing geographic information.It makes it possible to collect, organize, explore, model and view the spatial data for solving complex problems (Barreca et al., 2013).Different types of spatial data analysis range from the simple overlaying of various thematic layers to identify the region to the more complex use of mathematical equations or combined statistical models for the prediction of natural hazards.The importance of GIS in catastrophic evaluation was proven by many studies related to the usage of the GIS tools in exploration of various types of data (Steiniger and Hunter, 2013).
For example the existing hydrological GIS-based tools, such as Mike SHE and ArcSWAT, revealed considerable power in enhancing the accuracy of soil and water evaluations (Lei et al., 2011).These tools are capable of facilitating the modelling and calibration procedure, and decreasing the stages in implementing the models and increasing the precision of the outcomes (Hörmann et al., 2009).The creation of tools that automatically implement susceptibility mapping was applied by Akgun et al. (2012).Akgun et al. (2012) proposed MamLand, a program in MATLAB, to create landslide susceptibility mapping by using a fuzzy inference system.ArcGIS allows users to produce specific tools for spatial analysis (Stevens et al., 2007).For instance, Pradhan et al. (2014) developed a tool in ArcGIS to apply texture analysis for high-resolution radar data.Recently, a GIS-based system has been developed by Barreca et al. (2013) to evaluate and process the hazard associated with active faults influencing the eastern and southern flanks of Mt.Etna.The proposed tool was created in ArcGIS which contains various thematic data sets.It includes spatially referenced arc features and associated databases.In another paper, Lei et al. (2011) inte-grated a hydrological code EasyDHM and proposed an opensource MapWindow GIS tool called MWEasyDHM.Their aim was to create the tool by combining modules for preprocessing, modelling, viewing, and analysis.MWEasyDHM tool is user-friendly, free, and proficient which produces selectable multi-functional hydrological analysis.Similarly, a number of GIS tools are programmed by Etherington (2011) in the Python environment for landscape genetics researches.Tools are capable of transforming files, viewing genetic relatedness, and calculating landscape associations through leastcost path procedure.The tools are free and available in Arc-Toolbox.In a separate paper, Roberts et al. (2010) implemented the research to facilitate the advanced analytic methods.A Marine Geospatial Ecology Tool (MGET) was created in GIS environment which is free, easy to use, and an efficient tools for the ecologists.The tools were made by integrating different strong programming methods of Python, R, MATLAB, and C + +.
The current research aims to reduce the processing time of BSA by introducing an easy-to-use ArcMap tool.On the basis of the aforementioned problem statement regarding the required processing time and difficulties for BSA, a program that is capable of calculating BSA automatically should be developed.Hence, a tool programmed in Python and based on the BSA technique is proposed.This tool automatically extracts the correlation among each class of conditioning factor and event occurrence, reclassifies the factors on the basis of the acquired weights in a GIS environment, and saves each correlation in separate folders.A simple graphical user interface (GUI) improves the model operation because Python knowledge is not required.The entire process can be performed in ArcGIS without any requirement for another program.The proposed tool was tested to generate a landslide susceptibility map of Bukit Antarabangsa, Ulu Klang, Malaysia.

Methodology
The procedural and theoretical perspectives of BSA applied in this research include several steps (Fig. 1).In the methodology flowchart, the BSA tool was developed and integrated into ArcGIS.To apply BSA, the conditioning factors should be provided in raster format and classified with the proper scheme by the user.The BSA recognizes the effects of each class of conditioning factor on event occurrence.Hence, this step cannot be eliminated in the BSA process.As a second stage, a dependent variable (training layer) should be constructed by using the inventory map and other resources.This layer should contain a pixel value of one to represent the existence of an event.Once the conditioning factors are classified and the training layers are prepared, FR, WoE, and EBF can be applied automatically.The developed program reclassifies each conditioning factor by using the attained weights and saves them in a separate folder.The group of condition- ing factors that have been assessed by BSA are ready to be entered in the raster calculator to derive the corresponding hazard map.The following subsection represents the overall information on the scheme and functionality of the developed tool.

Overall information on scheme and functionality
The program is developed by using ArcGIS and Python for BSA.The tool can be used in the ArcGIS 9 and 10 versions.Figure 2 displays the interface of the tool in GIS toolbox.
The ArcToolbox provided in this research is used to enter the proposed tool in ArcMap.The user defines the source of the Python files of each model from the properties menu of the script (Fig. 3).
The program is partitioned into three sections: FR, WoE, and EBF.The theoretical concept and graphic interface of each tool is discussed in the following sections.

Frequency ratio
The theoretical expression of FR, as well as its usage in landslide susceptibility and flood mapping, has been reported in the studies conducted by Yilmaz (2009) and Tehrany et al. (2013).The FR method has a simple and understandable structure compared with other probabilistic methods.FR is described as the proportion of the region where an event occurred over the entire area; FR is also defined as the proportion of likelihood of an event occurrence to a non-occurrence for a particular attribute.FR can be calculated by using the following equation where N pix (SX i ) is the number of pixels, which contain an event in class i of the independent variable; X, N pix (X j ) is the number of pixels and exist in independent variables X j ; m is the number of categories of the independent variable X i .Furthermore, n is the total number of independent variables in the whole area (Yilmaz, 2009).Most of the researchers performed these calculations manually by using Microsoft Excel or other programs.Once the weights were obtained, these values were used to reclassify the independent variables by using the spatial analyst tool in ArcGIS.The raster calculator in ArcGIS was used to obtain the final susceptibility map.The proposed tool in ArcMap can apply the FR automatically and reclassify the independent variables on the basis of the gained weights.
The graphic interface of the FR tool consists of one window containing four fields (Fig. 4).Each field is user-defined in ArcGIS.The first field is the input raster, which is related to the desired conditioning factor.The training layer or dependent variable, which is predefined and saved prior to analysis, is selected for the second field.The cell size of the output and its location are specified by the user in the third and fourth fields, respectively.The developed tool has a simple structure, thus providing BSA for each conditioning factor within a few seconds.In manual calculations, this procedure usually requires a considerable amount of time to be implemented.The proposed tool reclassifies the analyzed conditioning factor based on the attained weights and saves it in the folder selected by the user.

Weight of evidence
The WoE method is a data-driven technique based on the Bayesian probability framework (Beynon et al., 2000;Neuhäuser and Terhorst, 2007;Porwal et al., 2006).This characteristic provides additional advantages to the proposed tool compared with other statistical methods.To implement WoE, two important parameters of positive weight (W + ) and negative weight (W − ) are computed (Bonham-Carter et al., 1989).This technique calculates the weight for each independent variable (B) on the basis of the existence or nonexistence of the event (A) within the study area (Xu et al., 2012a) by using the following equations where P represents the probability, ln is the natural log.B and B reveal the existence and non-existence of the independent variable.A and Ā show the existence and non-existence of the event.A positive weight (W + ) determines the presence of the specific independent variable at the event, and the amount of positive weight represents the positive correlation between the presence of the independent variable and event, respectively.A negative weight (W − ) indicates the non-existence of the independent variable and shows the amount of negative correlation.The weight contrast is the difference between the two weights of W + and W − : The size of the weight contrast demonstrates the spatial relationship between the independent variable and the event.The C value is positive in the case of a positive relationship and is negative in the case of a negative relationship.
The standard deviation of W is calculated as follows: where S(W + ) and S(W − ) are the variance of the positive and negative weights, respectively.These variances can be calculated by using the following equations By using the proportion of the contrast divided by its standard deviation, the studentized contrast is calculated.The studentized contrast is the final weight that assists the informal test if C is considerably different from zero or if the contrast is probable to be real.A complete explanation of the mathematical formulation of this method is accessible in Xu et al. (2012b).Figure 5 illustrates the user interface of the WoE tool.Each field should be defined similar to FR. theory has been used in several fields of study, including environmental and hazard studies (Awasthi and Chauhan, 2011).This theory also has relative flexibility, which is considered its advantage, accepts uncertainty, and is capable of combining beliefs from different sources of evidence.EBF estimates the probability that a hypothesis is true and evaluates how close the evidence is to the truth of that hypothesis.

Evidential belief function
A complex procedure is required to calculate EBF compared with FR.To compute the EBF, four functions (Bel, Dis, Unc, and Pls) should be measured separately (Lee et al., 2013).Individual computation by using specific formulas is required to provide this information.
Assume that a set of independent variables of C = (C i i = 1, 2, 3, . .., n), which contains mutually exclusive and exhaustive factors of C i , is used in current research.The function m : P (C) → [0, 1] is the basis of the probability assignment.
where C is the frame of discernment and P (C) is the set of all subsets of C, counting the empty set ( ) and C itself.Mass function is another name for the mentioned function that satisfies m( ) = 0 and AC m(A) = 1, where A is any subset of C. The degree in which the evidence supports A is calculated by m(A), which is represented by a belief function (Bel(A)).Suppose that N (L) and N (C) are the total number of pixels affected by the event and the total number of pixels in the study area, respectively; C ij is the j th class of the independent variable of C i (i = 1, 2, 3, . .., n); N (C ij ) is the total number of pixels in class C ij ; and N = (L ∩ C ij ) is the number of pixels affected by the event in C ij .Therefore, the datadriven measurements of EBF can be calculated using Eq. ( 6) and the following equations (Tien Bui et al., 2012) where the C ij is shown by W C ij (event) and supports the belief that the presence of the event is more than its non-existence.
The detailed mathematical calculation of each function has been discussed in several studies such as Lee et al. (2013).Figure 6 represents the interface of the EBF tool, and contains three more fields compared with the two other methods because each EBF function should be applied and saved in a separate folder.Hence, after the selection of the conditioning factor, training layer, and output cell size, the location to save each function should be defined.

Code description
The code was designed in Python 27 (the default software included with Windows 7).In the beginning, the arcPy library is called to check the code for spatial extension in order to continue the process.After that, when the user defines the raster, the code calls the raster data as text using the command GetParameterAsText which is part of arcPy library.Using same as the previous command, the code will define the output layer for the chosen model.The default path for all the sub-processes is defined to be in C drive because it is the default drive in all the systems.Therefore, the code creates folders called FR_modeler, WOE_modeler, or EBF_modeler depending on the selected process.
The next stage is to analyze the input layer (e.g.slope) and the lookup command will be applied to prepare the layer for the zonal geometry process.The zonal geometry is defined as the table used to work on the statistics of the output.A file is added to the attribute of the created table in the previous step, i.e. zonal, to be used for calculating the percentage of each class of the input layer.A statistics analysis was applied to calculate the sum of all the pixels of the selected layer.Then, a joining process is defined to link the created table with the input layer.Subsequently, a tabulate area process was ran to calculate the percentage of the occurrence of the independent factor (i.e.landslide) in each input layer classes.The last step for calculating FR is applied using Eq. ( 1).Then, the resulting value is defined as an integer and used to reclassify the input layer.The code includes a delete command to delete all the sub-process layers and tables.
The process of WoE and EBF contains the same process of FR as its initial step.However, more statistical analysis and more field are added to calculate the parameters of WoE and EBF which are listed in Eqs. ( 2)-( 8).In each selected model, a different folder will be created.The user may overwrite and redo the process as much as required because the command overwriteOutput was defined for each code.The flow chart regarding the three algorithms is shown in Figs. 7, 8, and 9.

Test area and data
Although the developed program can be used in any application that employs BSA, the proficiency of the tool was tested in the hazard domain.To examine the capability and efficiency of the developed program, landslide susceptibility analyses were performed by using the developed ArcMAP tool with three BSA models, namely FR, WoE, and EBF.The program was tested for the landslide susceptibility mapping of Bukit Antarabangsa, Ulu Klang, Malaysia (Fig. 10).
A spatial database was constructed and analyzed on the basis of the altitude, aspect, curvature, slope, stream power index, topographic wetness index, distance from the river, distance from the road, and geological layers.Comprehensive overview of the usage of BSA for landslide susceptibility mapping has been reported in numerous studies (Yalcin et al., 2011).A study conducted by Mohammady et al. (2012) provided additional knowledge on the capabilities of these three BSA methods.This previous research compared the three methods of FR, WoE, and EBF and determined the pros and cons of each statistical approach.A total of 47 landslide locations were recorded and a landslide inventory map was prepared.The allocation of the landslide inventory for training and testing was 70 and 30 %, respectively (Fig. 10).
The training data set (31 landslide locations) was chosen randomly and a dependent layer (landslide layer) was created.

Experimental results and discussion
To examine the efficiency of the developed bivariate statistical modeler (BSM) tool, landslide susceptibilities were derived by using all three methods.The correlation among the conditioning factors and landslide occurrence was extracted.The landslide probability index was measured and classified by using the proper scheme.To produce a susceptibility map, the probability index should be partitioned into various classes.The quantile method was applied in the current research because of its reputation in classification.In the quantile classification method, each class has the same number of features.The method provided appropriate results on the comparison between the created landslide susceptibility map and the spatial distribution of landslide events.The acquired landslide conditioning factors are shown in Fig. 11.
The derived landslide susceptibility map from WoE shows a different appearance compared with the two other maps.Validation should be performed to determine which map is reliable.The area under curve (AUC) was applied to examine the precision of the derived susceptibility maps (Pérez-Vega et al., 2012).The success rate values were 68, 63, and 76 % for FR, WoE, and EBF, respectively.Moreover, 71, 75, and 80 % were the prediction rates for FR, WoE, and EBF, respectively.The EBF represented the highest accuracy compared with other methods in terms of success and prediction rates.The prediction rate value for WoE was high but not as high as EBF.This result is caused by the greater proficiency and capability of EBF compared with WoE.Recognizing the best method for modelling is possible because any comparative study is restricted and the best method for a specific data set is significantly related to the characteristics of that data set.Figure 12 illustrates the computed accuracies.
The design and interface of the developed tool show that the BSA is simple to execute by using the proposed program compared with manual calculation.The derived susceptibility maps and their AUC values suggest that the tool is precise and reliable.Previous research has established that because of the nature of BSA, the obtained results are imprecise compared with machine learning and rule-based methods.Therefore, the measured accuracies are acceptable for these simple statistical methods.

Conclusions
To perform hazard studies, several requirements, such as constructing the precise spatial database, obtaining highresolution imagery, and providing a reliable inventory map, should be fulfilled.Users can be confronted with the insufficiency of appropriate and free tools to perform various analyses.This condition makes such studies complex and in some cases, time-consuming.The BSA is one of the fundamental methods in hazard mapping.Hence, developing a tool that manages a large number of factors with an automatic statistical and classification performance is within a separate software.The results have to be entered in a GIS environment and used to reclassify each conditioning factor one after another.The proposed BSM tool can be used to automate the BSA procedure and to facilitate the generation of the probability index.BSM is developed as a tool in ArcGIS, which is capable of performing the three BSA models of FR, WoE, and EBF.This tool can also manage large amounts of conditioning factors with reduced calculation time, thus allowing the replication of various trials.As an example, a significant characteristic of BSM is the reclassification of the conditioning factors on the basis of the acquired weight from BSA.The GUI also allows the application of FR, WoE, and EBF without entering any code from Python, thus helping the user in model operation.The application to landslide susceptibility mapping in Bukit Antarabangsa in Ulu Klang, Malaysia, provides significant outcomes.All three methods are applied and landslide susceptibility maps are created.FR, WoE, and EBF acquired success rates of 68, 63, and 76 %, respectively.AUC values for prediction rates are 71, 75, and 80 % for FR, WoE, and EBF, respectively.
In conclusion, the proposed tool can transform the BSA procedure into a simple and fast technique.This tool can assist scientists in performing statistical analyses for any environment and mathematical application.

Figure 1 .
Figure 1.General design of the methodology and BSA tool.

Figure 3 .
Figure 3. Procedure to add the BSM tool in ArcGIS.

Figure 4 .
Figure 4. Graphic user interface of the FR tool.

Figure 5 .
Figure 5. Graphic interface of the WoE tool.

Figure 6 .
Figure 6.Graphic interface of the EBF tool.

Figure 7 .
Figure 7. Code flowchart of the FR model.

Fig. 8 .
Fig. 8. Code flowchart of the WoE model.519 Figure 8. Code flowchart of the WoE model.

Figure 10 .
Figure 10.Location of the pilot study area for testing the proposed ArcMAP tool.

Figure 12 .
Figure 12.Graphic representation of the cumulative frequency diagram presenting the cumulative landslide occurrence (%; y axis) in landslide probability index rank (%; x axis): (a) success rate, and (b) prediction rate.
Code flowchart of the FR model.
Code flowchart of the FR model.