Complex, software-intensive, technically advanced, and computationally demanding models, presumably with ever-growing realism and fidelity, have been widely used to simulate and predict the dynamics of the Earth and environmental systems. The parameter-induced simulation crash (failure) problem is typical across most of these models despite considerable efforts that modellers have directed at model development and implementation over the last few decades. A simulation failure mainly occurs due to the violation of numerical stability conditions, non-robust numerical implementations, or errors in programming. However, the existing sampling-based analysis techniques such as global sensitivity analysis (GSA) methods, which require running these models under many configurations of parameter values, are ill equipped to effectively deal with model failures. To tackle this problem, we propose a new approach that allows users to cope with failed designs (samples) when performing GSA without rerunning the entire experiment. This approach deems model crashes as missing data and uses strategies such as median substitution, single nearest-neighbor, or response surface modeling to fill in for model crashes. We test the proposed approach on a 10-parameter HBV-SASK (Hydrologiska Byråns Vattenbalansavdelning modified by the second author for educational purposes) rainfall–runoff model and a 111-parameter Modélisation Environmentale–Surface et Hydrologie (MESH) land surface–hydrology model. Our results show that response surface modeling is a superior strategy, out of the data-filling strategies tested, and can comply with the dimensionality of the model, sample size, and the ratio of the number of failures to the sample size. Further, we conduct a “failure analysis” and discuss some possible causes of the MESH model failure that can be used for future model improvement.

Since the start of the digital revolution and subsequent increases in computer processing power, the advancement of information technology has led to the significant development of modern software programs for dynamical Earth system models (DESMs). The current-generation DESMs typically span upwards of several thousand lines of code and require huge amounts of data and computer memory. The flip side of the growing complexity of DESMs is that running these models will pose many types of software development and implementation issues such as simulation crashes and failures. The simulation crash problem happens mainly due to violation of the numerical stability conditions needed in DESMs. Certain combinations of model parameter values, an improper integration time step, inconsistent grid resolution, or lack of iterative convergence, as well as model thresholds and sharp discontinuities in model response surfaces, all associated with imperfect parameterizations, can cause numerical artifacts and stop DESMs from properly functioning.

When model crashes occur, the accomplishment of automated sampling-based model analyses such as sensitivity analysis, uncertainty analysis, and optimization becomes challenging. These analyses are often carried out by running DESMs for a large number of parameter configurations randomly sampled from a domain (parameter space) (see, e.g., Raj et al., 2018; Williamson et al., 2017; Metzger et al., 2016; Safta et al., 2015). In such situations, for example, the model's solver may break down because of implausible combinations of parameters (the “unlucky parameter set” as termed by Kavetski et al., 2006), failing to complete the simulation. It is also possible that a model will be stable against the perturbation of a single parameter, while it may crash when several parameters are perturbed simultaneously. “Failure analysis” is a process that is performed to determine the causes that have led to such crashes while running DESMs. Before achieving a conclusion on the most important causes of crashes, it is necessary to check the software code of the DESMs and confirm if it is error-free (e.g., if a proper numerical scheme has been adopted and correctly coded in the software). This often requires investigating both the software documentation and a series of nested modules. However, the existence of numerous nested programming modules in typical DESMs can make the identification and removal of all software defects tedious. In addition, as argued by Clark and Kavetski (2010), the numerical solution schemes implemented in DESMs are sometimes not presented in detail. This is one important reason why detecting the causes of simulation crashes in DESMs is usually troublesome. For example, Singh and Frevert (2002) and Burnash (1995) described the governing equations of their models without explaining the numerical solvers that were implemented in their codes.

Importantly, the impact of simulation crashes on the validity of global sensitivity analysis (GSA) results has often been overlooked in the literature, wherein simulation crashes have been commonly classified as ignorable (see Sect. 1.2). As such, a surprisingly limited number of studies have reported simulation crashes (examples related to uncertainty analysis include Annan et al., 2005; Edwards and Marsh, 2005; Lucas et al., 2013). This is despite the fact that these crashes can be very computationally costly for GSA algorithms because they can waste the rest of the model runs, prevent the completion of GSA, or inevitably introduce ambiguity into the inferences drawn from GSA. For example, Kavetski and Clark (2010) demonstrated how numerical artifacts could contaminate the assessment of parameter sensitivities. Therefore, it is important to devise solutions that minimize the effect of crashes on GSA. In the next subsection, we critically review the very few strategies for handling simulation crashes that have been proposed in the literature and identify their shortcomings.

We have identified, as outlined below, four types of approaches in the
modeling community to handle simulation crashes. The first two are perhaps
the most common approaches (based on our personal communications with
several modellers); however, we could not identify any publication that
formally reports their application.

After the occurrence of a crash, modellers commonly adopt a conservative strategy to address this problem by altering or reducing the feasible ranges of parameters and restarting the experiment in the hope of preventing a recurrence of the crashes in the new analyses.

Instead of GSA that runs many configurations of parameter values, analysts may choose to employ local methods such as local sensitivity analysis (LSA) by running the model only near the known plausible parameter configurations.

Some modellers may adopt an ignorance-based approach by using only a set of “good” (or behavioral) outcomes and responses in sampling-based analyses and ignoring unreasonable (or non-behavioral) outcomes such as simulation crashes. This can be done in conjunction with defining a performance metric to choose which simulations to exclude from the analysis (see, e.g., Pappenberger et al., 2008; Kelleher et al., 2013).

The most rigorous approach seems to be a non-substitution approach that tries to predict whether or not a set of parameter values will lead to a simulation crash. Webster et al. (2004), Edwards et al. (2011), Lucas et al. (2013), Paja et al. (2016), and Treglown (2018) are among the few studies that aimed at developing statistical methods to predict if a given combination of parameters can cause a failure. For example, Lucas et al. (2013) adopted a machine-learning method to estimate the probability of crash occurrence as a function of model parameters. They further applied this approach to investigate the impact of various model parameters on simulation failures. A similar approach is based on model preemption strategies, in which the simulation performance is monitored while the model is running and the model run is terminated early if it is predicted that the simulation will not be informative (Razavi et al., 2010; Asadzadeh et al., 2014).

Locating the regions of the parameter space responsible for crashes (i.e., “implausible regions”) is difficult and requires analyzing the behavior of the DESMs throughout the often high-dimensional parameter space. Implausible regions usually have irregular, discontinuous, and complex shapes and are thus too effortful to identify. Additionally, altering or reducing the parameter space by excluding the implausible regions changes the original problem at hand.

It is well known that local methods (e.g., LSA) can provide inadequate assessments that can often be misleading (see, e.g., Saltelli and Annoni, 2010; Razavi and Gupta, 2015).

Ignoring the crashed runs in GSA may only be seen as relevant when using purely random (and independent) samples (i.e., Monte Carlo method). In such cases, if the model crashes at a given parameter set, one may simply exclude that parameter set or generate another random parameter set (at the expense of increased computational cost) that results in a successful simulation.

Some efficient sampling techniques follow specific spatial arrangements; examples include the variance-based GSA proposed by Saltelli et al. (2010) or STAR-VARS in Razavi and Gupta (2016b). In GSA enabled with such structured sampling techniques, we cannot ignore crashed simulations because excluding sample points associated with simulation crashes will distort the structure of the sample set, causing inaccurate estimation of sensitivity indices. As a result, the user may have to redo part of or the entire experiment depending on the GSA implementation.

The implementation of the non-substitution procedures necessitates significant prior efforts to identify a number of model crashes based on which a statistical model can be built to predict and avoid simulation failures in the subsequent model runs. Such procedures can easily become infeasible in high-dimensional models, as they would require an extremely large sample size to ensure adequate coverage of the parameter space for characterizing implausible regions and building a reliable statistical model. These strategies can be more challenging when a model is computationally intensive. For example, to determine which parameters or combinations of parameters in a 16-dimensional climate model were predictors of failure, Edwards et al. (2011) used 1000 evaluations (training samples) to construct a statistical model to identify parameter configurations with a high probability of failure in the next 1087 evaluations (2087 model runs in total). As pointed out by Edwards et al. (2011), although 2087 evaluations might impose high computational burdens, a much larger sample size spreading out over the parameter space is required to guarantee reasonable exploration of the 16-dimensional space.

The primary goal of this study is to identify and test practical “substitution” strategies to handle the parameter-induced crash problem in GSA of the DESMs. Here, we treat model crashes as missing data and investigate the effectiveness of three efficient strategies to replace them using available information rather than discarding them. Our approach allows the user to cope with failed simulations in GSA without knowing where they will take place and without rerunning the entire experiment. The overall procedure can be used in conjunction with any GSA technique. In this paper, we assess the performance of the proposed substitution approach on two hydrological models by coupling it with a variogram-based GSA technique (VARS; Razavi and Gupta, 2016a, b).

The rest of the paper is structured as follows. We begin in the next section by introducing our proposed solution methodology for dealing with simulation crashes. In Sect. 3, two real-world hydrological modeling case studies are presented. Next, in Sect. 4, we evaluate the performance of the proposed methods across these real-world problems. The discussion is presented in Sect. 5, before drawing conclusions and summarizing major findings in Sect. 6.

We denote the output of each model run (realization)

We propose and test three techniques adopted from the “incomplete data analysis” for missing data replacement – the process known as imputation (Little and Rubin, 1987). Our techniques do not account for the mechanisms leading to crashes because identifying such mechanisms can be very challenging (Liu and Gopalakrishnan, 2017). Therefore, only the non-missing responses and the associated sample points are included in our analysis to infill model crashes for GSA, as described in the next subsections.

In sampling-based optimization, one may assign a very poor objective
function value (e.g., a very large objective function in the minimization
case) to a crashed solution, similar to the big M method for handling
optimization constraints (Camm et al., 1990). Our first strategy in the GSA
context adopts such an approach. However, since replacing crashes with a big
value can magnify the effect of the crashed runs in GSA, instead we suggest
choosing a measure of central tendency such as mean or median to minimize
the impact of the implausible parameter configurations on the GSA results.
If the distribution of the model responses is not highly skewed, imputing
the crashes with the mean of the non-missing values may work. However, if
the distribution exhibits skewness, then the median may be a better
replacement because the mean is sensitive to outliers. Therefore, we used
the median substitution technique for the experiments reported in this
paper. In general, this strategy treats each model response as a realization
of a random function and ignores the covariance structure of the model
responses. Also, a shortcoming of this technique is that while it preserves
the measure used for the central tendency of

The nearest-neighbor (NN) technique (also known as hot deck imputation,
see, e.g., Beretta and Santaniello, 2016) uses observations in the
neighborhood to fill in missing data. Let

In this study, we choose to use the single NN technique with a Euclidean
distance measure. We do so because the single NN technique is very
parsimonious and simple to understand and implement. To substitute the
crashed simulations, the single NN algorithm reads through whole dataset to
find the nearest neighbor and then imputes the missing value with the model
response of that nearest neighbor. It is noteworthy that some authors have
asserted that covariances among

Model emulation is a strategy that develops statistical, cheap-to-run
surrogates of response surfaces of complex, often computationally intensive
models (Razavi et al., 2012a). Here we develop an emulator

There are various choices for the basis function, such as Gaussian,
thin-plate spline, multi-quadric, and inverse multi-quadric (Jones, 2001).
In the present study, we utilize the well-known Gaussian kernel
function for RBF:

After choosing the form of the basis function, the coefficient vector

Note that in general depending on the complexity and dimensionality of the
model response surfaces, other types of emulations can be incorporated into
our proposed framework. However, for the crash-handling problem, it is
beneficial to utilize the function approximation techniques that exactly
pass through all sample points (i.e., the response surface surrogates
categorized as “exact emulators” in Razavi et al., 2012a) such as kriging
and RBF. This is mainly because most DESMs are deterministic and
therefore generate identical outputs and responses given the same set of input
factors. In other words, an exact emulator at any successful sample point

We illustrate the incorporation of the proposed crash-handling methodology into a variogram-based GSA approach called the variogram analysis of response surfaces (VARS; Razavi and Gupta, 2016a) and a variance-based GSA approach adopted from Saltelli et al. (2008). The VARS framework has successfully been applied to several real-world problems of varying dimensionality and complexity (Sheikholeslami et al., 2017; Yassin et al., 2017; Krogh et al., 2017; Leroux and Pomeroy, 2019). VARS is a general GSA framework that utilizes directional variograms and covariograms to quantify the full spectrum of sensitivity-related information, thereby providing a comprehensive set of sensitivity measures called IVARS (integrated variogram across a range of scales) at a range of different “perturbation scales” (Haghnegahdar and Razavi, 2017). Here, we use IVARS-50, referred to as “total-variogram effect”, as a comprehensive sensitivity measure since it contains sensitivity analysis information across a full range of perturbation scales.

We utilize the STAR-VARS implementation of the VARS framework (Razavi and Gupta, 2016b). STAR-VARS is a highly efficient and statistically robust algorithm that provides stable results with a minimal number of model runs compared with other GSA techniques, and thus it is suitable for high-dimensional problems (Razavi and Gupta, 2016b). This algorithm employs a star-based sampling scheme, which consists of two steps: (1) randomly selecting star centers in the parameter space and (2) using a structured sampling technique to identify sample points revolved around the star centers. Due to the structured nature of the generated samples in STAR-VARS, ignorance-based procedures (see Sect. 1.2) cannot be useful in dealing with simulation crashes because deleting sample points associated with crashed simulations will demolish the structure of the entire sample set. Moreover, to achieve a well-designed computer experiment and sequentially locate star centers in the parameter space, we use the progressive Latin hypercube sampling (PLHS) algorithm. It has been shown that PLHS can grasp the maximum amount of information from the output space with a minimum sample size, while outperforming traditional sampling algorithms (for more details, see Sheikholeslami and Razavi, 2017).

For the variance-based GSA, we calculate the total-effect index (Sobol-TO),
which accounts for the impact of any individual parameter and its
interaction with all other parameters, according to the widely used
algorithm proposed by Saltelli et al. (2008). This algorithm follows a
specific arrangement of randomly generated samples to calculate the
sensitivity indices as follows: first, an

As an illustrative example, we applied the HBV-SASK conceptual hydrologic
model to assess the performance of the proposed crash-handling strategies.
HBV-SASK is based on the Hydrologiska Byråns Vattenbalansavdelning model
(Lindström et al., 1997) and was developed by the second author for
educational purposes (see Razavi et al., 2019; Gupta and Razavi, 2018).
Here, we used HBV-SASK to simulate daily streamflows in the Oldman River
basin in western Canada (Fig. 1) with a watershed area of 1434.73 km

The Oldman River basin

The Nottawasaga River basin in southern Ontario, Canada (adapted from Sheikholeslami et al., 2019, with permission from Elsevier; license number: 4664891206213).

HBV-SASK model parameters and their feasible ranges used in this study. For information on the full parameter set, refer to Razavi et al. (2019).

In the second case study, we demonstrate the utility of imputation-based
methods in crash handling via their application to the GSA of a
high-dimensional and much more complex problem. We used the Modélisation
Environmentale–Surface et Hydrologie (MESH; Pietroniro et al., 2007),
which is a semi-distributed, highly parameterized land surface–hydrology
modeling framework developed by Environment and Climate Change Canada
(ECCC), mainly for large-scale watershed modeling with the consideration of cold
region processes in Canada. MESH combines the vertical energy and water
balance of the Canadian Land Surface Scheme (CLASS; Verseghy, 1991; Verseghy
et al., 1993) with the horizontal routing scheme of the WATFLOOD (Kouwen et
al., 1993). We encountered a series of simulation failures while assessing
the impact of uncertainties in 111 model parameters (see Table A1 in
Appendix A) on simulated daily streamflows in the Nottawasaga River basin,
Ontario, Canada (Fig. 2). For this case study, the drainage basin of nearly
2700 km

In the first case study, for STAR-VARS, we chose to sample 100 star centers (with a resolution of 0.1) from the feasible ranges of parameters (Table 1) using the PLHS algorithm, resulting in 9100 evaluations of the HBV-SASK model. For the variance-based method, the base sample size was chosen to be 5000, and thus the model was run 60 000 times. The larger base sample size was selected for the variance-based method to ensure the stability of the algorithm. The Nash–Sutcliffe (NS) efficiency criterion on streamflows was used as the model output for sensitivity analysis.

After calculating the NS values, we performed a series of experiments, each with a different assumed “ratio of failure” (from 1 % to 20 %), defined as the percentage of failed parameter sets to the total number of parameter sets. In each experiment, we randomly selected a number of sampled points based on the associated ratio of failure and considered them to be simulation failures. Then, we evaluated the performance of the crash-handling strategies in replacing simulation failures during GSA of the HBV-SASK model and compared the results with the case when there are no failures. In addition, we accounted for the randomness in the comparisons by carrying out 50 replicates of each experiment with different random seeds. This allowed us to see a range of possible performances for each strategy and to assess their robustness when crashes occurred at different locations in the parameter space.

In the second case study having 111 parameters, we only tested STAR-VARS with 100 star centers randomly generated using the PLHS algorithm (with a resolution of 0.1), resulting in 100 000 MESH runs. The NS performance metric was used to measure daily model streamflow performance, calculated for a period of 3 years (October 2003–September 2007) following a 1-year model warm-up period.

Due to various physical and/or numerical
constraints inside MESH (or more precisely in CLASS), some combinations of
the 111 parameters caused model crashes. Here, approximately 3 % of our
simulations failed (3084 out of 100 000 runs). We applied the proposed
crash-handling strategies to infill the missing model outcomes in the GSA of
the MESH model. The entire set of 100 000 function evaluations of the MESH
model would take more than 6 months if we used a single standard CPU core.
However, we used the University of Saskatchewan's high-performance computing
system to run the GSA experiment in parallel on 160 cores. Therefore,
completing all model runs required approximately 32 h. For this case
study, using an Intel^{®} Core^{™} i7 CPU 4790 3.6 GHz
desktop PC, the RBF technique took only 65 s to substitute 3084
crashed runs, while the single NN technique required about 97 s to
complete the task.

According to both the IVARS-50 and Sobol-TO sensitivity indices, the
parameters of the HBV-SASK (when there were no model crashes) were ranked as
follows from the most important to the least important one:

Grouping of the 10 parameters of the HBV-SASK model when applied on the Oldman River basin. The parameters are sorted from the most influential (to the left) to the least influential (to the right).

Figures 4, 5, and 6 show the cumulative distribution functions (CDFs) for the 50 independent estimates of IVARS-50 obtained when 1 %, 3 %, 5 %, 8 %, 10 %, 12 %, 15 %, and 20 % of model runs were deemed to be simulation failures. Overall, the RBF and single NN techniques outperformed the median substitution in terms of closeness to the true GSA results and robustness when crashes happened at different locations of the parameter space.

Comparison of the proposed crash-handling strategies in
a sensitivity analysis of the HBV-SASK model using the STAR-VARS algorithm for
different ratios of failure. The CDFs of the sensitivity indices for
strongly influential parameters

Comparison of the proposed crash-handling strategies in
a sensitivity analysis of the HBV-SASK model using the STAR-VARS algorithm for
different ratios of failure. The CDFs of the sensitivity indices for
moderately influential parameters

Comparison of the proposed crash-handling strategies in
a sensitivity analysis of the HBV-SASK model using the STAR-VARS algorithm for
different ratios of failure. The CDFs of the sensitivity indices for weakly
influential parameters (

As can be seen, by increasing the ratio of failure, the performance of the
crash-handling strategies, particularly median substitution, became
progressively worse. Note that the median substitution technique resulted in
a significant bias manifested through the overestimation of the sensitivity
indices for all the parameters. Moreover, Figs. 4 and 6 show that when
crashes were substituted using the RBF technique, the STAR-VARS algorithm
estimated the sensitivity indices of the most important parameters

More importantly, as the number of crashes increases, the rankings of the parameters in terms of their importance may change. Figures 7 and 8 show the number of times out of 50 independent runs that the rankings of the parameters were equal to the “true” ranking for the STAR-VARS and variance-based GSA algorithms. In all 50 runs, regardless of the number of model crashes, the rankings obtained by the STAR-VARS using the RBF technique were equal to the true ranking, indicating a high degree of robustness in terms of parameter ranking. The performance of single NN slightly decreased when the crash percentage was more than 15 %, while the STAR-VARS algorithm wrongly determined the rankings in more than 50 % of the replicates using the median substitution technique (see Fig. 7c and d). This highlights the fact that the rankings can be estimated much more accurately than the sensitivity indices in the presence of simulation crashes. In addition, it can be seen that while the RBF-based strategy performed perfectly in this example, the performance of the single NN technique was comparably good (Fig. 7). However, for the variance-based technique, only the rankings of the most important parameters were equal to the true ranking, regardless of the number of model crashes and the utilized crash-handling strategy (Fig. 8). Moreover, the performance reduction of the single NN technique was higher when the variance-based method was employed. In fact, the variance-based algorithm wrongly estimated the rankings in more than 30 % percent of the replicates using the single NN technique when the ratio of failure was 15 % (Fig. 8c) and 20 % (Fig. 8d).

Comparison of the crash-handling strategies in estimating the
parameter rankings for the HBV-SASK model using the STAR-VARS algorithm when the
ratio of failure was

Comparison of the crash-handling strategies in estimating the
parameter rankings for the HBV-SASK model using the variance-based algorithm
when the ratio of failure was

Finally, Fig. 9 presents the performance of the single NN (Fig. 9a and c)
and RBF (Fig. 9b and d) strategies in approximating the fill-in values for
the missing responses when 5 % (Fig. 9a and b) and 20 % (Fig. 9c and d) of
the HBV-SASK simulations were deemed failures. As shown, RBF
outperformed the single NN technique in terms of closeness to the true NS
values. For example, with 20 % of the model runs failing, the linear
regression had an

Scatterplots of the true NS values versus the imputed NS values
when the ratio of failure was 5 %

We demonstrate the GSA results of the MESH model by categorizing the 111 parameters of the model into three groups as shown in Fig. 10 (for more details on grouping, see Sheikholeslami et al., 2019). Figures 11–13 present the sensitivity analysis results obtained by the STAR-VARS algorithm for the MESH model when we applied different crash-handling strategies. These groups are labeled according to their importance; i.e., Group 1 (Fig. 11) contains the strongly influential parameters, while the parameters in Group 2 (Fig. 12) are moderately influential, and Group 3 (Fig. 13) is the group of weakly influential parameters.

Grouping of the 111 parameters of the MESH model. The parameters are sorted from the most influential (to the left) to the least influential (to the right). This grouping is based on the results of the RBF method.

Sensitivity analysis results of the MESH model using different
crash-handling strategies for the most influential parameters. To better
illustrate the results, the highly influential parameters in Group 1 (see
Fig. 10) are separately shown in panels

Sensitivity analysis results of the MESH model for moderately
influential parameters using different crash-handling strategies. To better
illustrate the results, the moderately influential parameters in Group 2
(see Fig. 10) are separately shown in panels

Sensitivity analysis results of the MESH model using different
crash-handling strategies for weakly and/or non-influential parameters in Group 3
(see Fig. 10). Panels

The four most influential parameters in Group 1 are

For the other 15 influential parameters in Group 1 (Fig. 11b),
there is general agreement between the three crash-handling techniques
about the sensitivity indices calculated by the STAR-VARS except for the
parameter

Figure 12 illustrates the sensitivity indices for the moderately influential parameters (i.e., Group 2). For all 78 of these parameters, the sensitivity analysis results were highly dependent on the chosen crash-handling strategy. As can be seen, the sensitivity indices associated with the median substitution and RBF techniques are higher than those obtained by the single NN technique (this difference is more considerable for the parameters in Fig. 12a and c than those in Fig. 12b).

Finally, the results of the sensitivity analysis for the weakly or non-influential (Group 3) parameters of the MESH model are plotted in Fig. 13. The STAR-VARS algorithm identified these parameters as weakly influential (very low IVARS-50 values) using the proposed crash-handling techniques. However, the associated sensitivity indices obtained by the RBF imputation method are 2 orders of magnitude larger for the parameters in Fig. 13a and c and about 4 orders of magnitude larger for the parameters in Fig. 13b and d compared to those obtained by the single NN and median substitution methods.

It is important to note that in high-dimensional DESMs, when the number of parameters is very large, the estimation of sensitivity indices is likely not robust to sampling variability. On the other hand, parameter ranking (the order of relative sensitivity) is often more robust to sampling variability and converges more quickly than factor sensitivity indices (see, e.g., Vanrolleghem et al., 2015; Razavi and Gupta, 2016b; Sheikholeslami et al., 2019). To investigate how different crash-handling strategies can affect the ranking of the model parameters in terms of their importance, Fig. 14 compares the rankings obtained by the RBF, single NN, and median substitution techniques.

Comparing rankings of the MESH model parameters obtained by
different crash-handling strategies using the STAR-VARS algorithm. Panels

As shown in Fig. 14a, the single NN and median substitution techniques resulted in almost similar parameter rankings for the strongly influential (Group 1) and weakly influential (Group 3) parameters, while for moderately influential parameters (Group 2) the rankings are significantly different. Meanwhile, the RBF and median substitution techniques yielded very distinctive rankings except for the strongly influential parameters (Fig. 14b). Furthermore, Fig. 14c indicates that the single NN and RBF methods provided similar rankings for the influential parameters.

A closer examination, however, reveals that rankings can be contradictory
for some of the parameters when using different crash-handling strategies
(see Fig. 14d–f). For example, consider the soil moisture suction
coefficient for crops (

Our further investigations of the MESH model revealed at least two possible
causes for many of the simulation failures, i.e., the threshold
behavior of some parameters and oversaturation of the soil layers. For
example, the threshold behavior of

A 2-D projection of the MESH parameters for successful (blue
dots) and crashed (red dots) simulations. Panels

Furthermore, from our analysis we found that the oversaturation of the soil
layer might happen, especially at lower values of the soil permeable depth
(

As can be seen from Fig. 15, very high values of the parameters

Due to the extremely large parameter space of high-dimensional DESMs, it may
require many properly distributed sample points (

Because non-substitution procedures rely on constructing a statistical
model based on the observed crashes, to predict and avoid them in
follow-up experiments, they need a good coverage of the domain to attain a
reliable statistical model. This issue also challenges the use of
imputation-based methods. For example, in NN techniques (both single and

A crucial consideration in the use of any sampling strategy is the

Given this, regardless of the chosen method for solving the simulation crash
problem in GSA, it is advisable to spend some time up front to find an
optimal sample set before submitting it for evaluation to
computationally expensive DESMs. It is therefore necessary to prudently
use improved sampling algorithms such as progressive Latin hypercube
sampling (PLHS; Sheikholeslami and Razavi, 2017),

We conclude this section by highlighting a point that should receive careful attention when applying substitution-based methods in handling model crashes. In addition to the numerical artifacts in simulation models, some combinations of parameter values, which may not be physically justified, can also lead to simulation failures. As a result, there is a risk that substituting data for these crashed runs can contaminate the assessment of parameter importance. Preventing this type of risk requires knowledge about the reasonable parameter ranges in DESMs. This type of crash can be significantly reduced by selecting plausible ranges of parameters based on physical knowledge or information of the problem (a process referred to as “parameter space refinement”; see, e.g., Li et al., 2019; Williamson et al., 2013). However, DESMs often consist of many interacting, uncertain parameters, and therefore very little may be known a priori about the implausible regions of the parameter space.

Understanding the complex physical processes in Earth and environmental systems and predicting their future behaviors routinely rely on high-dimensional, computationally expensive models. These models are often involved in the processes of model calibration and/or uncertainty and sensitivity analysis. If a simulation failure or crash occurs during any of these processes, the models will stop functioning and thus need user intervention. Generally, there are many reasons for the failure of a simulation in models, including the use of inconsistent integration time steps or grid resolutions, lack of convergence, and threshold behaviors in models. Determining whether these “defects” exist in the utilized numerical schemes or are programming bugs can only be done by analyzing a high-dimensional parameter space and characterizing the implausible regions responsible for crashes. This imposes a heavier computational burden on analysts. More importantly, every “crashed” simulation can be very demanding in terms of computational cost for global sensitivity analysis (GSA) algorithms because they can prevent the completion of the analysis and introduce ambiguity into the GSA results.

These challenges motivated us to implement missing data imputation-based strategies for handling simulation crashes in the GSA context. These strategies involve substituting plausible values for the failed simulations in the absence of a priori knowledge regarding the nature of the failures. Here, our focus was to find simple yet computationally frugal techniques to palliate the effect of model crashes on the GSA of dynamical Earth system models (DESMs). Thus, we utilized three techniques, including median substitution, single nearest neighbor, and emulation-based substitution (here we used radial basis functions as a surrogate model), to fill in a value for the failed simulations using available information and other non-missing model responses. The high efficiency of our proposed substitution-based approach is of prominent importance, particularly when dealing with GSA of computationally expensive models, mainly because our proposed approach does not require repeating the entire experiment.

We compared the performance of our approach in GSA of two modeling case
studies in Canada, including a 10-parameter HBV-SASK conceptual hydrologic
model and a 111-parameter MESH land surface–hydrology model. Our analyses
revealed the following.

Overall, emulation-based substitution can effectively handle the simulation crashes and produce promising sensitivity analysis results compared to the single nearest-neighbor and median substitution techniques.

As expected, the performance of the proposed methods deteriorates as the ratio of failure increases. The rate of degradation depends on the number of model parameters (the dimensionality of the parameter space).

We observed in our experiments that the utilized crash-handling strategy (i.e., median substitution, single NN, and RBF) has a minimum influence on the rankings of the strongly and weakly influential parameters identified by the GSA algorithms, while for the moderately influential parameters, different strategies yielded different rankings.

Future work should include extending the proposed crash-handling approach to a time-varying sensitivity analysis of DESMs because a comprehensive GSA requires a full consideration of the dynamical nature of the models. Our proposed approach can be integrated with any time-varying sensitivity analysis algorithm, for example with the recently developed generalized global sensitivity matrix (GGSM) method (Gupta and Razavi, 2018; Razavi and Gupta, 2019). This helps us further understanding the temporal variation of the parameter importance and model behavior. Finally, another possible future direction is to apply and test other types of emulation techniques, such as kriging and support vector machines, in handling model crashes.

The MATLAB codes for the proposed crash-handling approach and the HBV-SASK
model are included in the VARS-TOOL software package, which is a
comprehensive, multi-algorithm toolbox for sensitivity and uncertainty
analysis (Razavi et al., 2019). VARS-TOOL is freely available for
noncommercial use and can be downloaded from

Parameters of the MESH model and their corresponding groups are listed in Table A1. A description of the parameters and their feasible ranges can be found in Haghnegahdar et al. (2017).

Grouping of 111 MESH model parameters. These groups are numbered in order of importance.

Comparison of the proposed crash-handling strategies in
a sensitivity analysis of the HBV-SASK model using the variance-based
algorithm for different ratios of failure. The CDFs of the sensitivity
indices for strongly influential parameters

Comparison of the proposed crash-handling strategies in
a sensitivity analysis of the HBV-SASK model using the variance-based
algorithm for different ratios of failure. The CDFs of the sensitivity
indices for moderately influential parameters

Comparison of the proposed crash-handling strategies in
a sensitivity analysis of the HBV-SASK model using the variance-based
algorithm for different ratios of failure. The CDFs of the sensitivity
indices for weakly influential parameters (

All authors contributed to conceiving the ideas of the study. RS and SR designed the method and experiments. RS carried out the simulations for the first case study. AH performed the MESH simulations for the second case study. RS developed the MATLAB codes for the proposed crash-handling approach and conducted all the experiments. RS wrote the paper with contributions from SR and AH. All authors contributed to the interpretation of the results and structuring and editing the paper at all stages.

The authors declare that they have no conflict of interest.

The authors were financially supported by the Integrated Modelling Program for Canada (IMPC;

This paper was edited by Steve Easterbrook and reviewed by two anonymous referees.