In this study, a probabilistic model, named as BayGmmKda, is proposed for flood susceptibility assessment in a study area in central Vietnam. The new model is a Bayesian framework constructed by a combination of a Gaussian mixture model (GMM), radial-basis-function Fisher discriminant analysis (RBFDA), and a geographic information system (GIS) database. In the Bayesian framework, GMM is used for modeling the data distribution of flood-influencing factors in the GIS database, whereas RBFDA is utilized to construct a latent variable that aims at enhancing the model performance. As a result, the posterior probabilistic output of the BayGmmKda model is used as flood susceptibility index. Experiment results showed that the proposed hybrid framework is superior to other benchmark models, including the adaptive neuro-fuzzy inference system and the support vector machine. To facilitate the model implementation, a software program of BayGmmKda has been developed in MATLAB. The BayGmmKda program can accurately establish a flood susceptibility map for the study region. Accordingly, local authorities can overlay this susceptibility map onto various land-use maps for the purpose of land-use planning or management.

Flooding is one of the most destructive natural hazards that cause heavy loss of human lives and property in immense spatial extent (Dottori et al., 2016; Komi et al., 2017). Recent statistics on flood damages for the period of 1995–2015 shows that flooding affected 109 million people around the globe per year (Alfieri et al., 2017) and killed more than 220 000 people (Winsemius et al., 2015). Although the frequency of flooding has decreased in several regions (i.e., in central Asia and America), flood occurrences have increased globally by 42 % (Hirabayashi et al., 2013).

Notably, Southeast Asia is one of the most heavily flood-damaged regions in the world due to monsoonal rainfalls and tropical hurricane patterns (Loo et al., 2015). Located in this region, Vietnam is a storm center on the western Pacific, and this nation has faced the destructive consequence of flooding in many of its provinces. In Vietnam, floods are often triggered by tropical cyclones. More than 71 % of the Vietnam's population and 59 % of the total land area of Vietnam are susceptible to the impacts of these natural hazards (Tien Bui et al., 2016c). Based on a report by Kreft et al. (2014), from 1994 to 2013, Vietnam endured an annual economic loss that is equivalent to USD 2.9 billion.

Additionally, the occurrences of flood in Vietnam are expected to rise rapidly in the near future due to the increases in poorly planned infrastructure developments and urbanization near watercourses, as well as an increased deforestation and climate change. Hence, an accurate model for evaluating flood hazards for land-use planning becomes a crucial need for land-use planning as well as establishment of disaster mitigation strategies. Based on flood prediction models, flood-prone areas can be identified and mapped (Tien Bui et al., 2016c).

Needless to say, the identification of susceptible areas can significantly reduce flood damage to the national economy and human lives by avoiding infrastructure developments and densely populated settlements in highly flood-susceptible areas (Zhou et al., 2016). This identification also helps government agencies to issue appropriate flood management policies and to focus its limited financial resources on constructing large-scale flood defense infrastructure in areas that have great economic value but are highly susceptible to flood (Bubeck et al., 2012; Mason et al., 2010). Therefore, a tool for spatial flood modeling is of great usefulness.

To predict flood occurrence, conventional approaches require time series of meteorological and streamflow data at gauging stations (Machado et al., 2015). However, this is difficult for many areas in developing countries where no gauging stations are available. Therefore, new modeling approaches should be explored and investigated. Given these motivations, this study proposes a novel methodology designed for achieving a high prediction accuracy as well as deriving probabilistic evaluations of flood susceptibility on a regional scale. Accordingly, spatial prediction of flooding is carried out based on a statistical assumption that flooding in the future will occur under the same conditions that triggered them in the past (Tien Bui et al., 2016b). In this way, the flood prediction problem boils down to an on–off supervised classification task, where flood inventories are used to define the class of flood occurrence. Moreover, the class nonflood occurrence is derived from areas that have not yet been damaged by flooding. Consequently, spatial prediction of flooding within the study area is achieved based on the probability of pixels belonging to the class of flood occurrences. To yield probabilistic outputs of flood susceptibility, this study proposes a Bayesian framework established on the basis of an integration of a Gaussian mixture model (GMM) and the kernel Fisher discriminant analysis (KFDA). GMM is employed for density approximation to calculate the posterior probability of flood (flood susceptibility index); in addition, KFDA constructs a latent variable based on the geoenvironmental conditions to enhance the performance of the Bayesian model.

In essence, the proposed integrated framework contains two phases of analysis. RBFDA is first employed for latent variable construction. The Bayesian approach assisted by GMM is then used to perform probabilistic pattern recognition. The first level performs pattern discriminant analysis tasks and the second level carries out the prediction process to derive the model output of flood evaluation. Based on previous studies which indicate that hierarchical model structures can produce improved prediction accuracy, the proposed framework could potentially bring about desirable flood assessment results. The subsequent parts of this study are organized in the following order: related works on flood prediction are summarized in Sect. 2. The next section introduces the research method of the current paper, followed by Sect. 4 which describes the proposed Bayesian model for flood susceptibility forecasting. Section 5 reports the model prediction accuracy and comparison. The last section discusses some conclusions on this work.

Because of the criticality of flood prediction, this problem has gained an increasing attention from the academic community. Following this trend, various flood analyzing tools have been developed (Winsemius et al., 2013; Papaioannou et al., 2015; Gao et al., 2017; Alfieri et al., 2014). Basically, these tools could be classified into statistical analysis, rainfall–runoff models, and classification models. Statistical analysis uses long-term recorded time series data at gauged stations to establish regression models; accordingly, the constructed regression models are used to transform flood information to ungauged basins (Yue et al., 1999; Cunnane, 1988; McCuen, 2016). Thus, these models are capable of providing discharge predictions both in space and time. However, long-term data are not always available; in many cases, they are generally too short for reliable estimations of extreme quantiles (Seckin et al., 2013b; Nguyen et al., 2014).

Rainfall–runoff models, which deal with estimation of runoff from rainfall, are considered to be the most extensively used approach for flood prediction and management (Nayak et al., 2013; Ciabatta et al., 2016; Bennett et al., 2016). Various types of rainfall–runoff models can be found in the literature, varying from empirical models to highly sophisticated physical processes. Empirical models could be established based on statistical techniques (Brocca et al., 2011; Neal et al., 2013) or advanced machine learning algorithms (Lohani et al., 2011); such models can be effectively employed to analyze rainfall and runoff on the basis of historical time series data. In addition, physical-process models focus on simulating hydrological processes in a basin based on a set of mathematical equations governing physical processes of water flow and surfaces (Aronica et al., 2012; Chiew et al., 1993; Beven et al., 1984; Birkel et al., 2010; Grimaldi et al., 2013). In general, rainfall–runoff models require relatively long-term time series data at gauging stations. However, the density of gauging stations in developing countries is very low and this fact creates a great obstacle to the establishment of accurate hydrological models (Fenicia et al., 2008). In addition, large-scale field works and deployments of measuring equipment are necessary for collecting data.

In recent years, a new flood modeling approach called “on–off” classification of flood occurrence has been successfully proposed for spatial prediction of flood (or alternatively called a flood susceptibility index; Tien Bui et al., 2016d; Tehrany et al., 2014, 2015b). Accordingly, no time series data are required for the model calibration, and the establishment of flood models is based on flood inventories (flood class) and nonflood areas (nonflood class). Accordingly, the probability of a pixel in the study area belonging to the flood class is used as flood susceptibility index. Moreover, it is noted that the results of the model depend on the collection of sufficient training data. Although the flood susceptibility map provides no temporal prediction or return period of flood, the flood map is capable delineating highly susceptible areas. Thus, it is a powerful flood analysis tool for decision-makers that could be used in land-use planning and flood management.

The literature review shows that data-driven methods integrated with GIS databases have demonstrated their effectiveness and accuracy in large-scale flood susceptible predictions. An fuzzy-logic-based algorithm, established by Pulvirenti et al. (2011), has been used to develop a map of flooded areas from synthetic aperture radar imagery; this algorithm is used for the operational flood management system in Italy. A model based on the frequency ratio approach and GIS for spatial prediction of flooded regions was first introduced by Lee et al. (2012); the spatial database was constructed by field surveys and maps of the topography, geology, land cover, and infrastructure.

Prediction models with artificial neural networks (ANNs) have been employed for flood susceptibility evaluation by various scholars (Kia et al., 2012; Seckin et al., 2013a; Rezaeianzadeh et al., 2014; Radmehr and Araghinejad, 2014); previous works have shown that an ANN is a capable nonlinear modeling tool. Nevertheless, ANN learning is prone to overfitting, and its performance has been shown to be inferior to that of support vector machines (SVMs; Hoang and Pham, 2016). Kazakis et al. (2015) introduced a multicriteria index to assess flood hazard areas that relies on GIS and analytical hierarchy processes (AHPs); in this methodology, the relative importance of each flood-influencing factor for the occurrence and severity of flood was determined via AHP. More recently, support-vector-machine-based flood susceptibility analysis approaches have been proposed by Tehrany et al. (2015a, b); the research finding is that SVM is more accurate than other benchmark models, including the decision tree classifier and the conventional frequency ratio model.

Mukerji et al. (2009) constructed flood forecasting models based on an adaptive neuro-fuzzy interference system (ANFIS), genetic algorithm optimized ANFIS; experiments demonstrated that ANFIS attained the most desirable accuracy. Recently, a metaheuristic optimized neuro-fuzzy inference system, named as MONF, has been introduced by Tien Bui et al. (2016c); this research pointed out that MONF is more capable than decision tree, ANN, SVM, and conventional ANFIS methods.

As can be seen from the literature review, various data-driven and advanced soft-computing approaches have been proposed to construct different flood forecasting models. In most previous studies, the flood prediction was formulated as a binary pattern recognition problem in which the model output is either flood or no flood. Probabilistic models have rarely been examined to cope with the complexity as well as uncertainty of the problem under concern. Therefore, our research aims to enrich the body of knowledge by proposing a novel Bayesian probabilistic model to estimate the flood vulnerability with the use of a GIS database.

In this research, Tuong Duong district (central Vietnam) is selected as the
study area (see Fig. 1). This is by far one of the most heavily affected flood regions in
the country (Reynaud and Nguyen, 2016). The area of the district is
approximately 2803 km

Location of the Tuong Duong district (central Vietnam).

The district has two separated seasons, namely a cold season (from November to March) and a hot season (from April to October). The yearly rainfall of the district is within the range of 1679–3259 mm. The rainfall amount is primarily intensified during the rainy period which contributes to roughly 90 % of the total annual rainfall. Due to the district's location as well as its topographic and climatic features, the study area is highly susceptible to flood events with immense effects on the rate of human casualties and economic loss. An examination carried out by Reynaud and Nguyen (2016) reported that approximately 40 % of families have been affected by floods and roughly 20 % of families must be relocated away from the flooded areas; the average loss from flooding is up to 24 % of the family income each year.

Prediction of flood zones can be based on an assumption that future flood events are governed by the very similar conditions of flooded zones in the past. Therefore, flood inventories and the geoenvironmental conditions (e.g., topological and hydrological features) that produced them must be extensively determined and collected (Tien Bui et al., 2016c; Tehrany et al., 2015b). The first step of this analysis is to establish a flood inventory map for the region under investigation. In this study, the flood inventory map established by Tien Bui et al. (2016c) was used to analyze the relationships between flood occurrences and influencing factors.

The flood inventory map stores documentations of past flood events (see
Fig. 1). It is noted that the type of floods in this study area are flash
floods. This is the main flood type in this region due to characteristics of
the terrain. The map was constructed by gathering information of the study
area, field works at flood areas, and analyses from results of the Landsat-8
operational land imagery (from 2010 to 2014) with a resolution of 30 m
(retrieved from

Although the data for this study were collected from 2010 to 2014, there were recurrent flash floods which occurred during tropical typhoons in this period. Thus, it is reasonable to conclude that all significant flash flood locations in the study area have been revealed and determined. It should be noted that due to the statistical assumption used in this study, the inclusion of flood locations in the distant past (i.e., before the year of 2009) for flood susceptibility analysis may cause bias. It is because the construction of new hydropower dams such as Ban Ve (from 2010) and Nam Non (from 2011) and deforestation or forestation have changed the geoenvironmental conditions in the study area (Dao, 2017; Manley et al., 2013). In other words, the geoenvironmental conditions of the distant past are very different to those of the present time; therefore, flood locations in the distant past should not be included in the current analysis.

Flood-influencing factors and their categories.

To construct a flood prediction model, besides the flood inventory map, it is
crucial to determine the flood-influencing factors (Tehrany et al., 2015a).
It is proper to note that the selection of the flood-governing factors varies
due to different characteristics of study areas and the availability of data
(Papaioannou et al., 2015). Based on the previous work of Tien Bui et
al. (2016c), the physical relationships between influencing factors and flood
processes have been analyzed. Accordingly, a total of 10 influencing factors
were selected in this study; they include slope (IF

Flood-influencing factors:

The flood prediction in this study is considered as a pattern classification problem within which “flood” and “nonflood” are the two class labels of interest. As a result, the probability (posterior probability) of pixels belonging to the flood class, which are derived from the model, will be used as susceptibility indices. These susceptibility indices of the pixels are then used to generate the flood susceptibility map. To cope with the complexity as well as the uncertainty of the problem of interest, a Bayesian framework is employed in this study to evaluate the flood susceptibility of each data sample. Figure 3 demonstrates the general concept of the Bayesian framework used for classification.

General concept of the Bayesian Framework for flood classification.

The Bayesian framework provides a flexible way for probabilistic modeling.
This method features a strong ability for dealing with uncertainty and noisy
data (Theodoridis, 2015; Cheng and Hoang, 2016). Nevertheless, previous
studies have rarely examined the capability of this approach for inferring
flood susceptibility. Basically, pattern classification aims at assigning a
pattern to one of

Generally, the prior probabilities

It is noted that the posterior probability value (Eq. 1) for each pixel of the study area is used as flood susceptibility index. To obtain the posterior probability, the class-conditional PDF must be estimated. This section presents how PDF is estimated by a Gaussian mixture model. A GMM is selected in this research because it has been shown to be an effective parametric method for modeling of data distribution, especially in high-dimensional space (McLachlan and Peel, 2000; Theodoridis and Koutroumbas, 2009). Previous studies (Paalanen, 2004; Figueiredo and Jain, 2002; Gómez-Losada et al., 2014; Arellano and Dahyot, 2016) point out that any continuous distribution can be approximated arbitrarily well by a finite mixture of Gaussian distributions. Due to their usefulness as a flexible modeling tool, GMMs have received an increasing amount of attention from the academic community (Zhang et al., 2016; Khanmohammadi and Chou, 2016; Ju and Liu, 2012).

In a

A GMM is, in essence, an aggregation of several multivariate normal
distributions; hence, its PDF for each data sample is computed as a weighted
summation of Gaussian distributions (see Fig. 4):

Structure of a Gaussian mixture model.

Accordingly, the PDF for all data samples can be expressed as follows (Ju and
Liu, 2012):

Identifying a GMM's parameters

Practically, instead of dealing with the log-likelihood function, an
equivalent objective function

In order to compute

The expectation maximization (EM) method is a statistical approach to fit a
GMM based on historical data; this method converges to a maximum likelihood
estimate of model parameters (McLachlan and Krishnan, 2008). It can be
recapitulated as follows (McLachlan and Peel, 2000). Commencing from an
initial parameter

These two steps of the EM procedure are stated as follows: (i) E step:
estimating the expected classes of all data samples for each
class

The EM algorithm increases the log-likelihood iteratively until convergence is detected, and this approach generally can derive a good set of estimated parameters. Nonetheless, EM suffers from low convergence speed in some datasets, high sensitivity to initialization condition, and suboptimal estimated solutions (Biernacki et al., 2003). Moreover, additional efforts are required to determine an appropriate number of Gaussian distributions within the mixture.

As an attempt to alleviate such drawbacks of EM, Figueiredo and Jain (2002) put forward an unsupervised algorithm for learning a GMM from multivariate data. The algorithm features the capability of identifying a suitable number of Gaussian components autonomously, and through experiments the authors show that the algorithm is not sensitive to initialization. In other words, this unsupervised approach incorporates the tasks of model estimation and model selection in a unified algorithm. Generally, this method can initiate with a large number of components. The initial values for component means can be assigned to all data points in the training set; in an extreme case, it is possible to distribute the component number equal to the data point number. This algorithm gradually fine-tunes the number of mixture components by casting out elements of normal distributions that are irrelevant for the data modeling process (Paalanen, 2004).

Furthermore, Figueiredo and Jain (2002) employed the minimum message length
(MML) criterion (Wallace and Dowe, 1999) as an index for model selection; the
application of this criterion for the case of GMM learning leads to the
following objective function (Figueiredo and Jain, 2002):

In detail, the EM algorithm is employed to estimate

Accordingly, the parameters

The established GIS database.

In machine learning, the performance of a model may be enhanced if latent
variables are used (Yu, 2011). Therefore, latent variable approach is
employed in this research. Accordingly, radial-basis-function Fisher
discriminant analysis (RBFDA) proposed Mika et al. (1999), an extension of
the Fisher Discriminant Analysis for dealing with data nonlinearity, is used
to generate a latent factor for flood analysis. Thus, RBFDA is utilized to
project the feature from the original learning space to a projected space
that expresses a high degree of class reparability (Theodoridis and
Koutroumbas, 2009). Using this kernel technique, the data from an input space

Herein,

To obtain

Since a solution of the vector

From Eqs. (17) and (19), we have the following:

Taking into account the formulas of

Based on the Eq. (17) that defines

Considering all Eqs. (14), (21), and (22), the solution of RBFDA can
be found by maximizing the following:

The optimization problem with the objective function expressed in Eq. (23)
is found by identifying the primal eigenvector of

The proposed BayGmmKda.

To formulate a flood assessment model, the first stage is to construct a GIS database (see Fig. 5) within which locations of past flood events, maps of topographic feature, Landsat-8 imagery, maps of geological features, and precipitation statistical records are acquired and integrated. In this study, the data acquisition, processing, and integration were performed with ArcGIS (version 10.2) and IDRISI Selva (version 17.01) software packages.

Furthermore, a C

The proposed model for flood susceptibility assessment that incorporates RBFDA, the Bayesian classification framework, and GMM is presented in this section of the study. The overall flowchart of the proposed Bayesian framework based on GMM and RBFDA for flood susceptibility prediction, named as BayGmmKda, is demonstrated in Fig. 6.

Firstly, the whole dataset, including 152 data samples, was separated into two sets: a training set (90 % or 137 samples), employed for model establishing, and a testing set (10 % or 15 samples), used for model testing. It is noted that the input variables of the dataset have been normalized using the minimum–maximum normalization; the purpose of data normalization was to hedge against the situation of unbalanced variable magnitudes.

Secondly, a latent input factor was generated using the RBFDA (explained in Sect. 3.4) and added to the training dataset, with the aim of enhancing the classification performance. Subsequently, the feature evaluation was performed to quantify the degree of relevance of each input factors with the flood inventories in the training set. Any nonrelevant factor should be eliminated from the modeling process to reduce noise and enhance the model performance (Tien Bui et al., 2016a, 2017). For this purpose, in this research, the Mutual Information Criterion (Kwak and Choi, 2002; Hoang et al., 2016), a widely employed techniques for feature selection in machine learning, was selected to express the pertinence of each influencing factors to the flood. It is noticed that the larger the mutual information, the stronger the relevancy between the influencing factor and flood.

In the next step, the BayGmmKda model was trained and established using the
training set. The purpose of the training process was to find the best
parameters for the mixture component (

Using the best

It is noted that the coupling of the GMM with the EM training algorithm is implemented with the
MATLAB statistical toolbox (MathWorks, 2012a); meanwhile, the BayGmmKda
performs the unsupervised algorithm with the program code provided by
Mário A. T. Figueiredo (

Main menu of BayGmmKda.

As shown in Fig. 7, the program consists of three modules: data process and visualization, model training, and model prediction. The first module provides basic functions for data inspection and visualization, including data normalization, data viewing, and preliminary feature selection with mutual information. In the second module, the users simply provide model parameters, including the kernel function parameter and the GMM training method. The trained model is employed to carry out prediction tasks in the third module, within which the model prediction performance is reported.

The outcome of the preliminary examination on the pertinence of flood-influencing factors is reported in Fig. 8a. As mentioned earlier, the
relevancies of influencing factors are exhibited by the mutual information
criterion. Based on the outcome, IF

It is worth keeping in mind that the BayGmmKda's training phase is executed in two consecutive steps, training RBFDA and training GMM. RBFDA analyzes the data in the training set to establish a latent factor which is a one-dimensional representation of the original input pattern. Figure 8b shows the resulted latent factor constructed by RBFDA. In the next step of the training phase, GMM is constructed by the original input patterns with their corresponding labels which consist of 10 input factors and with the RBFDA-based latent factor.

The classification accuracy rate (CAR) is employed to exhibit the rate of
correctly classified instances. In addition, a more detailed analysis on the
model capability can be presented by calculating true positive rate (TPR),
false positive rate (FPR), false negative rate (FNR), and true negative rate
(TNR). These four rates are also widely utilized to exhibit the predictive
capability of a prediction model (Hoang and Tien-Bui, 2016).

In addition to the four rates, the receiver operating characteristic (ROC) curve (van Erkel and Pattynama, 1998) is used to summarize the global performance of the model. The ROC curve basically demonstrates the trade-off between the two aforementioned TPR and FPR, when the threshold for accepting the positive class of flood varies. In addition, the area under the ROC curve (AUC) is employed to quantify the global performance. In generally, a better model is characterized by a larger value of the AUC.

As aforementioned, the dataset is randomly separated into the training set and the testing set which occupy 90 and 10 % of the data samples, respectively. The training set is employed to train the mode; meanwhile, the testing set is used for validating the model capability after being trained. Since one selection of data for the training set and the testing set may not truly demonstrate the model's predictive capability, this study carries out a repetitive subsampling procedure within which 30 experimental runs are carried out. In each experimental run, 10 % of the dataset is retrieved in a random manner from the database to constitute the testing set; the rest of the database is included in the training set.

The testing performance of the proposed Bayesian framework for flood
susceptibility is reported in Table 2 and Fig. 9, which provides the
average ROC curves of the proposed model framework, obtained from the random
subsampling process, with two methods of GMM training. Herein, the two
Bayesian models that employ the EM algorithm and the unsupervised learning (UL)
algorithm for training GMM are denoted as BayGmmKda-EM and BayGmmKda-UL,
respectively. It can be seen that the BayGmmKda-UL model demonstrates clearly
better predictive performance (CAR

ROC plots of the proposed BayGmmKda.

Prediction results of BayGmmKda.

Because this is the first time the BayGmmKda model has been proposed for the measurement flood susceptibility, the validity of the proposed model should be assessed. Hence, the benchmarks were used for the comparison, including the support vector machine, adaptive neuro-fuzzy inference system, and the GMM-based Bayesian classifier. The above machine learning techniques were selected because SVM and ANFIS have been recently verified to be effective tools for predicting flood susceptibility (Tien Bui et al., 2016c; Tehrany et al., 2015b). It is noted that the GMM-based Bayesian classifier (BayGmm) is the Bayesian framework for classification which employs GMM for density estimation; however, BayGmm is not integrated with the RBFDA algorithm. BayGmm is used in the performance comparison section to confirm the advantage of the newly constructed BayGmmKda and to verify the usefulness of RBFDA in enhancing the discriminative capability of the hybrid framework.

To construct the SVM model, the model's hyperparameters of the regularization
constant (

Performance comparison of the BayGmmKda model with the three benchmarks, the SVM model, the ANFIS model, and the BayGmm model.

It is noted that a random subsampling with 30 runs is employed for all models
in this experiment. The result comparison between the proposed BayGmmKda
model and three benchmark models is shown in Table 3. The result shows that
the proposed model yields the best results (CAR

Model comparison based on the Wilcoxon signed-rank test.

The flood susceptibility map using the proposed BayGmmKda model for the study area.

To confirm the performance of the proposed BayGmmKda model is significantly
higher than that of the three benchmark model, the Wilcoxon signed-rank test
is employed. The Wilcoxon signed-rank test is widely used to evaluate whether
classification outcomes of prediction models are significantly dissimilar
(Tien Bui et al., 2016e). Using this test, the

Experimental outcomes have indicated that the BayGmmKda model is the best for this study area, and therefore the model was used to compute the posterior probability for all the pixels of the study area. The posterior probability values that were used as flood susceptibility indices were further transformed to a raster format and open in ArcGIS 10.4 software package. Using these indices, the flood susceptibility map (see Fig. 10) was derived and visualized by the mean of five classes: very high (10 %), high (10 %), moderate (10 %), low (20 %), and very low (50 %). The threshold values for separating these classes were determined by overlaying the historical flood locations and the flood susceptibility indices map (Tien Bui et al., 2016c), and then a graphical curve (see Fig. 10) was constructed and the threshold values were derived.

Interpretation of the map shows that 10 % of the Tuong Duong district was classified into the very high class and this class covers 73.68 % of the total historical flood locations. Meanwhile, both the high class and the moderate classes cover 10 % of the region but account for only 15.79 and 7.9 % of the total historical flood locations, respectively, whereas the low class covers 20 % of the district but it contains only 2.63 % of the total historical flood locations. In particular, 50 % of the district, which is categorized to the very low class, contains no flood location. These results indicate that the proposed BayGmmKda model has successfully delineated susceptible flood-prone areas. In other words, the interpretation results confirm the reliability of the proposed Bayesian framework in this work.

This research has developed a new tool, named as BayGmmKda, for flood susceptibility evaluation, with a case study in a high-frequency flood area in central Vietnam. The newly constructed model is a Bayesian framework that combines GMM and RBFDA for spatial prediction of flooding. A GIS database has been established to train and test the BayGmmKda method. The training phase of BayGmmKda consists of two steps: (i) discriminant analysis with RBFDA in which a latent factor is generated and (ii) density estimation using GMM. After the training phase, the Bayesian framework is employed to compute the posterior probability. The posterior probability was then used as flood susceptibility index. Furthermore, a MATLAB program with GUI has been developed to ease the implementation of the BayGmmKda model in flood vulnerability assessment.

It is noted that in this study, the GMM training is performed with two methods: the EM algorithm and the unsupervised learning approach. Furthermore, a repeated subsampling process with 30 experimental runs is carried out to evaluate the model prediction outcome. The subsampling process verified by statistical test confirms that the GMM method trained by the unsupervised learning approach has attained a better prediction accuracy compared with the EM algorithm. Therefore, this method of GMM learning is strongly recommended for other studies in the same field.

Furthermore, the experiments demonstrate that the latent factor created by RBFDA is really helpful in boosting the classification accuracy of the BayGmmKda model. This melioration in accuracy of the BayGmmKda stems from its integrated learning structure. As described earlier, the classification task is performed by a hybridization of discrimination analysis and a Bayesian framework. The Bayesian model carried out the classification task by consideration of the patterns in the original dataset and an additional factor produced from the discrimination analysis. As result, the performance of the BayGmmKda model is better than those obtained from the three benchmarks (SVM, ANFIS, and BayGmm).

The main limitation in this work is that the BayGmmKda is a data-driven tool; therefore, field works and GIS-based geoenvironmental data are necessary for the model construction phase. This data collection and analysis can be time-consuming. In addition, the grid search procedure is used for hyper-parameter setting in the BayGmmKda model requires a high computational cost, especially for large-scale datasets. Furthermore, the outcome of this grid search procedure may not be optimal; therefore, more advanced model selection approaches, i.e., metaheuristic optimization algorithms, could be utilized to further improve the model accuracy.

Despite such limitations, the proposed BayGmmKda model, featured by its high predictive accuracy and the capability of delivering probabilistic outputs, is a promising alternative for flood susceptibility prediction. Future extensions of this research may include the model application in flood prediction for other study areas, investigations of other flood-influencing factors (i.e., streamflow and antecedent soil moisture which may be relevant for flood analysis) and improving the current model with other novel soft computing methods, i.e., feature selection, pattern classification, and dimension reduction to alleviate the aforementioned drawbacks as well as to enhance the model performance.

The MATLAB code of the BayGmmKda model is given in the Supplement.

The dataset used in this research is given in the Supplement.

The authors declare that they have no conflict of interest.

This research was partially supported by Department of Business and IT, School of Business, University College of Southeast Norway. Data for this research are from the project no. B2014-02-21 and were provided by Quoc-Phi Nguyen (Hanoi University of Mining and Geology, Vietnam). Edited by: Jeffrey Neal Reviewed by: two anonymous referees