Surrogate-assisted Bayesian inversion for landscape and basin evolution models

The complex and computationally expensive features of the forward landscape and sedimentary basin evolution models pose a major challenge in the development of efficient inference and optimization methods. Bayesian inference provides a methodology for estimation and uncertainty quantification of free model parameters. In our previous work, parallel tempering Bayeslands was developed as a framework for parameter estimation and uncertainty quantification for the landscape and basin evolution modelling software Badlands. Parallel tempering Bayeslands features high-performance computing with dozens of processing cores running in parallel to enhance computational efficiency. Although parallel computing is used, the procedure remains computationally challenging since thousands of samples need to be drawn and evaluated. In large-scale landscape and basin evolution problems, a single model evaluation can take from several minutes to hours, and in certain cases, even days. Surrogate-assisted optimization has been with successfully applied to a number of engineering problems. This motivates its use in optimisation and inference methods suited for complex models in geology and geophysics. Surrogates can speed up parallel tempering Bayeslands by developing computationally inexpensive surrogates to mimic expensive models. In this paper, we present an application of surrogate-assisted parallel tempering where that surrogate mimics a landscape evolution model including erosion, sediment transport and deposition, by estimating the likelihood function that is given by the model. We employ a machine learning model as a surrogate that learns from the samples generated by the parallel tempering algorithm. The results show that the methodology is effective in lowering the overall computational cost significantly while retaining the quality of solutions.


Introduction
The Bayesian methodology provides a probabilistic approach for the estimation of free parameters in complex models [4,5]. Hence, a deterministic geophysical forward model can be seen as a probabilistic model via Bayesian inference which provides a rigorous approach to uncertainty quantification as opposed to optimization methods. The approach is also known as Bayesian inversion which has been used for landscape evolution [1,6], geological reef evolution models [7] and other geo-scientific models [4,8]. A Markov Chain Monte Carlo method (MCMC) can be used to implement Bayesian inference for estimation and uncertainty quantification of free parameters [9,10,11,5]. Parallel tempering is a MCMC method that [12,13] features multiple replicas to provide global and local exploration which makes them suitable for irregular and multi-modal distributions [14,15]. In contrast to canonical MCMC sampling methods, parallel tempering can be more easily implemented in a multicore or parallel computing architecture [16].
In our previous work, parallel tempering Bayeslands was presented as a framework for parameter estimation and uncertainty quantification for Badlands, a landscape and basin evolution evolution software [1]. Parallel tempering Bayeslands features high performances parallel computing to enhance the efficiency of estimating free parameters of a Badlands model. Although parallel computing is used, the procedure remains computationally challenging since thousands of samples need to be drawn and evaluated [1]. In large scale landscape evolution problems, a single model can take hours to days. Hence, it is useful to find ways to improve parallel tempering Bayeslands. Such problems are common for complex forward models of physical processes that can take several hours to days and months to evaluate a single model run. One of the ways to address this problem is using surrogate-assisted estimation.
Surrogate assistant optimization refer to the use of statistical and machine learning models to built approximate simulation of the actual model [17]. Many optimization methods lack a rigorous approach for uncertainty quantification, leading to Bayesian inversion as an alternative, particularly for complex geophysical numerical models [8,4]. The major advantage of a surrogate model is its computational efficiency when compared to the equivalent numerical physical forward model [2,3]. In the optimization literature, surrogate usage is also known as response surface methodologies [18,19] that have been applicable for a wide range of engineering problems [20,21] such as aerodynamic wing design [2]. A number of approaches have been used to improve the way surrogates are utilized. [3] combined global and local surrogate models to accelerate evolutionary optimization. [22] presented a generalized surrogate-assisted evolutionary computation framework to unify diverse surrogate models during optimization and taking into account uncertainty in estimation. Jin [17] reviewed a range of problems such as single, multi-objective, dynamic, constrained, and multi-modal optimization problems [23]. In the Earth sciences, examples for surrogate assisted approaches include modeling water resources [24,25], atmospheric general circulation models [26], computational oceanography [27], carbon-dioxide (CO2) storage and oil recovery [28] and debris flow models [29].
Given that Bayesian inversion is implemented using parallel tempering, parallel computing infrastructure is required and the challenge is how to incorporate surrogates across different processing cores. Recently, surrogate-assisted parallel tempering has been developed for Bayesian neural networks which presents a global-local surrogate framework where surrogate training is executed in the master processing core that is used to manage the replicas running in parallel [30]. The method gives promising results, retaining classification accuracy while lowering computational time.
In this paper, we present an application of surrogate-assisted parallel tempering [30] for Bayesian inversion of surface process models that employ parallel computing infrastructure. We use the Badlands landscape evolution model [31] as a case study to demonstrate the framework. Overall, the framework features the surrogate-model which mimics the Badlands model and estimates the likelihood function to evaluate the proposed parameters. We employ a neural network model as the surrogate that learns from the history of samples proposed by the parallel tempering algorithm. The entire framework is developed in a parallel computing infrastructure to take advantage of parallelism for surrogate-assisted parallel tempering. We apply the method to several selected benchmark landscape evolution and sediment transport/deposition problems and show the quality of the estimation of the likelihood given by the surrogate when compared to the actual Badlands model.

Bayesian inference via Parallel tempering
Bayesian inference is based on Bayes theorem and typically implemented by employing MCMC sampling methods to update the probability for a hypothesis as more information becomes available. The hypothesis is given by a prior probability distribution (also known as the prior) that expresses one's belief about a quantity (or free parameter in a model) before some data is taken into account. Therefore, MCMC methods provide a probabilistic approach for estimation of free parameters in a wide range of models [32,33]. The likelihood function is a function of the parameters of a given model provided specific observed data which in the case of landscape evolution would be the ground-truth topography or the observed sedimentary record. Hence, the likelihood function can be seen as a fitness measure of the proposals. In order to evaluate the likelihood function, one would need to run the given model which in our case is the Badlands model. The likelihood function is used with the Metropolis-criteria to either accept or reject a proposal. When accepted, the proposal becomes part of the posterior distribution which essentially provides the estimation of the free parameter with uncertainties. The sampling process is iterative and requires thousands of samples to be drawn until convergence is reached. In our case, convergence is defined by a predefined number of samples or until the likelihood function has reached a specific value. Convergence essentially means that the posterior distribution of the given parameters generate Badlands model outputs that resemble ground-truth data [1].
As noted earlier, parallel tempering is a MCMC method that features parallelism with enhanced exploration capabilities. It features a number of replicas with slight variations in the acceptance criteria through relaxation of the likelihood with a temperature ladder that affects the acceptance criterion. The replicas associated with higher temperature levels have more chance in accepting weaker proposals (solutions) which could help in escaping a local minimum. Given an ensemble of N replicas defined by the temperature ladder, the state of the ensemble is specified by X = x 1 , x 2 , ..., x N , where x i is the replica at temperature level T i . The equilibrium distribution of the ensemble, X is given by where E(x i ) is the energy function and Z(T i ) = exp(− 1 T i E(x i ))dx i is the partition function of the replica at T i . A Markov chain is constructed to sample E(x i ) at each temperature level T i . At every iteration, the Markov chains can feature two types of transitions that include 1) the Metropolis transition and 2) a replica transition.
In the Metropolis transition phase, each replica is sampled independently to perform local Monte Carlo moves defined by the temperature which is implemented by a change in the energy function, E(x i ) for each temperature level T i . The configuration x * i is sampled from a proposal distribution q i (.|x i ) and the Metropolis-Hastings ratio at temperature level T i is given as where, L, represents the likelihood at the local replica and the new state is accepted with probability min(1, W L (x i → x * i )). The detailed balance condition holds for each replica and therefore it holds for the ensemble system.
The Replica transition phase considers the exchange of current state between two neighbouring replicas based on the Metropolis-Hasting acceptance criteria. Hence, given a probability α, pairs of replica defined by two neighboring temperature levels, i and i + 1 are exchanged.
The exchange of neighboring replicas that provide an efficient balance between local and global exploration [8]. The temperature ladder and replica exchange have been of focus of investigation in the past [34,35,36,14]. There is a consensus that they need to be tailored for different types of problems given by their likelihood landscape. In this paper, the selection of temperature spacing between the replicas is carried out using a Geometric spacing methodology [37].
where i = 1, . . . , M and T max is maximum temperature which is user defined and dependent on the problem.

Badlands and Bayeslands
Landscape evolution models incorporate different driving forces such as tectonics or climate variability [38,39,40,41,42] and combine empirical data and conceptual methods into a set of mathematical equations. Badlands (basin and landscape dynamics) [31] is an example of such a model that can be used to reconstruct landscape evolution and associated sediment fluxes [43,44]. We use Badlands [31,45,46] to simulate landscape evolution and sediment transport/deposition of selected areas in order to provide estimation with uncertainty quantification of the free parameters such as precipitation and erodibility.
The Badlands model simulates landscape dynamics which requires an initial topography that is exposed to climate and geological factors over time. In order to create test problems for Badlands, a set of climate and geological parameters defined by θ needs to be predefined to determine landscape evolution over a given timescale T . The final (ground-truth) topography at time T and expected sediment deposits at selected intervals in time are used to evaluate the quality of proposals during sampling in Bayeslands [1]. Bayeslands features parallel tempering for the estimation of free parameters and uncertainty quantification in model outputs for landscape simulation [1]. Bayeslands estimates a set of free parameters given by θ that is constrained by some data D. Bayeslands samples the posterior distribution p(θ|D) using principles of Bayes rule where, p(D|θ) is the likelihood of the data given the parameters, p(θ) is the prior, and p(D) is a normalizing constant and equal to p(D|θ)p(θ)dθ. θ represents the set free parameters such as precipitation and erodibility in the Badlands model and the data D represents the real topography. The prior distribution (also known as prior) refers to one's belief in the distribution of the parameter without taking into account the evidence or data. The prior distribution is adjusted by sampling from the posterior with given likelihood function that takes into account the data and the model. The goal of Bayeslands is to find estimate the θ given the posterior distribution such that the simulated topography by Badlands can match the real topography D.

Benchmark landscape evolution problems
We select two benchmark landscape problems presented in parallel tempering Bayeslands [1] that were adapted from earlier work [6]. These include Continental Margin (CM) and Synthetic-Mountain landscape evolution problems which have been chosen due to their computational time required for running a single model. Both of these problems use less than ten seconds to run a single model on a single central processing unit (CPU). These problems are well suited for a parameter evaluation for the proposed surrogate-assisted Bayesian inversion framework. In order to demonstrate an application which is computationally expensive, we introduce another problem, which features the landscape evolution of Tasmania, Australia, for a million years that features the region shown in Figure  1. The Synthetic-Mountain landscape evolution is a synthetic problem while the Continental-Margin problem is a real-world problem based on the topography of a region along the eastern margin of the South Island of New Zealand as shown in Figure  1. We then use Badlands to evolve the initial landscape with parameter settings such as rainfall and erodibility given in Table 1 and Table 2 and create the respective problems synthetic ground-truth topography.
The initial and synthetic ground-truth topographies along with erosion-deposition that shows sediment formation for these problems appear in Figure 2 and 3, respectively. Note that the figure shows that the Synthetic-Mountain is flat in the beginning, then given constant uplift rate along with weathering with constant rainfall parameter value, a mountain is formed. We note that we use present-day topography as the initial topography in the Continental-Margin and Tasmania problems, whereas, a synthetic flat region is used as Synthetic-Mountain initial topography. Each of these problems involve an erosion-deposition model history that is used to generate synthetic ground-truth data for the final model state that we then attempt to recover. Hence, the likelihood function given in the following subsection takes both the landscape topography and erosion-deposition ground-truth into account. The Continental-Margin and Tasmania cases feature six free parameters (Table  2) whereas the Synthetic-Mountain features 5 free parameters. Note that the marine diffusion coefficients are absent for the Synthetic-Mountain problem since the region does not cover or overlap with coastal and marine areas. The main reason behind choosing the two benchmark problems is due to their nature, i.e the Synthetic-Mountain problem features uplift rate which is not featured in the Continental-Margin problem. The Continental-Margin problem features other parameters such as the marine coefficients. The Tasmania problem features a much bigger region hence more computational time is used for running a single model. The common feature in all three problems is that they model both the topography erosion-deposition to show sediment formation over time. Furthermore, the priors were drawn from a uniform distribution with lower and upper limit given in Table 3.

Bayeslands Model
The likelihood function captures the quality of topography simulation along with quality of successive erosion-deposition which denotes the sediment thickness evolution through time. This makes the sampling problem more challenging but useful for certain applications. More specifically, the likelihood function evaluates the quality of the proposals by taking into account the difference between the final simulated Badlands topography and the ground-truth topography. The likelihood function also considers the difference between the simulated and groundtruth sediment thickness at selected time intervals. Hence, we use the likelihood function from [1] which is given as follows.
Let the initial topography be denoted by D 0 , with D 0 = (D 0,s 1 . . . , D 0,s n ), where s i corresponds to site s i , with coordinates latitude, u i , and longitude, v i . Suppose that we are         interested in the topography t max years into the future, we will denote this by D t , with D T defined to be the current topography. Hence, the model that generates the process is given by for t = 0, 1, . . . , T and i = 1, . . . , n, where θ are the parameters of the Badlands model and f t,s i (θ) is the output of the Badlands forward model. This essentially states that the topography is function of the Badlands forward model given parameters θ, plus some Gaussian noise with zero mean and variance τ 2 . The likelihood function L e (θ) is given by where the subscript e, in L e (θ), denotes the elevation likelihood.
We note that Badlands produces successive time-dependent topographies; however, only the final topography D T is used for the calculation of the elevation likelihood, because usually little ground-truth information is available for the detailed evolution of surface topography. In contrast, sediments preserve the stratigraphic record of the time-dependence of sedimentation and can be used to ground-truth the time-dependent evolution of surface process models that include sediment transport and deposition, as is the case for Badlands. We therefore define another random variable z t = (z t,s 1 . . . , z t,s m ) which represents the sedimentary record at sites s 1 , . . . , s m . We assume that observed values of z t are a function of the Badlands ground-truth forward model, with parameter θ and some Gaussian noise then the sediment likelihood, L s (θ) is giving a total likelihood L(θ): The complete set of unknowns in the model is given by θ, τ 2 and σ 2 .

Surrogate-assisted parallel tempering for Badlands model
The surrogate model learns from the relationship between the set of input parameters and response given by the true model. In our case, the input is the set of proposals giving by the sampler in the parallel tempering algorithm which features the proposals for the Badlands model parameters such as rainfall and erodibility. The likelihood estimation by the surrogate model is called the pseudo-likelihood.
In a paralle computing environment we need to take into account the cost of inter-process communication which must be limited to avoid computational overhead. As given in our past implementation [1], the swap interval refers to the number of iterations after which each replica pauses and can undergo a replica transition. After the swap proposal is accepted or rejected, the replicas are resumed and they continue iterating while undergoing Metropolis transition in between the swap intervals. We note that the surrogate-assisted estimation is incorporated into the multi-core parallel tempering algorithm. In terms of the training procedure for the surrogate, [30] used a surrogate interval that determines the frequency of training by collecting the history of past samples with their likelihood from the respective replicas.
Taking into account that the true model is represented as y = f (x), the surrogate model provides an approximation in the formŷ =f (x), such that y =ŷ + e where e represents the difference or error. The task of the surrogate model is to provide pseudo-likelihood such that the the true likelihood is estimated by training from history of proposals which is given by the set of input x r,s and likelihood y s where s represents the sample and r represents the replica. Hence, the training dataset Φ for the surrogate is developed by fusion of x r,s across all the replica for a given surrogate interval ψ. Therefore, this can be formulated as follows.
where x r,s represents the set of parameters proposed at sample s, y r,s = log p(y A D,T |x r,s ) is the Gaussian likelihood which is dependent on data and the geo-scientific model, M is the total number of replicas. The training surrogate dataset Θ features input Φ and response λ at the end of every surrogate interval denoted by s + ψ. Hence, the pseudo likelihoodŷ is given bŷ y =f (Θ), wheref is the surrogate model. The likelihood in training data is relaxed with respect of the temperature since it has been changed by taking L local /T r for given replica r. We undo this change by multiplying the likelihood by the respective temperature which is a data processing step for surrogate model. Algorithm 1 provides the details for execution of surrogateassisted parallel tempering for the Badlands model. The algorithm begins by initializing the replicas that sample θ n that represent Badlands model parameters such as rainfall and erodibility. This is done by drawing from a uniform distribution in a range [α, α] where α. The temperature ladder employs geometric ladder as given in Equation 4. Other key parameters include: 1. replica swap-interval R swap , 2. maximum number of samples for each replica R max , 3. surrogate interval, S interval , and 4. surrogate probability S prob . All of these values are determined experimentally.
After these are determined, the algorithm begins sampling for the respective replicas. Initially, the first surrogate interval considers the evaluation of all the proposals by the true likelihood function. Afterwards the data from the respective replicas are concatenated into training data Θ and used for training the surrogate model as shown in State 31 of Algorithm 1. Once the surrogate is trained, it can be used to provide the pseudolikelihood.
Given that the implementation considers each replica executed on a separate processing unit, a master processing unit is used to manage all the respective replicas as shown in Figure  4. The master process executes all the replicas in parallel and manages them by taking into account replica swap and surrogate training via the surrogate interval. The communication between the master process and the replica process requires interprocess communication protocols which is shown in Figure 4 and implemented by multi-processing libraries 1 .
The pseudo-likelihood is utilized according to the surrogate probability as shown in State 6 in Algorithm 1. The surrogate model keeps updating its knowledge gained by data through observing the true likelihood from all of the replicas. The surrogate model is re-trained when remaining surrogate intervals are reached until the maximum sampling time is reached. At each every training interval, the surrogate model trains with knowledge from the previous state. Hence, the surrogate model remains up-to-date and through transfer of knowledge form previous intervals, it gets better in estimation for the pseudolikelihood. We note that only the samples associated with the true-likelihood becomes part of the surrogate training dataset.
Note that the training is done in the master process which features the global surrogate model as given in Figure 4. The replica processes provide the training dataset by file output which is read and concatenated by the master process. The way this is implemented is through having copies of the surrogate model (untrained one) in each of the replicas. After training, the knowledge (i.e. weights of surrogate model) are transferred to each of the replicas as demonstrated in Figure 4. At the time of estimation for the pseudo-likelihood in each replica, we call the local surrogate that contains the knowledge from the global surrogate gained from the training data in the previous surrogate interval. The framework is flexible and hence the surrogate model at hand can be chosen by the user according to the nature of the likelihood surface.
The quality of estimation from the surrogate model can be validated by the root mean squared error (RMSE) which considers the difference between the likelihood and the pseudolikelihood. This can be seen as a regression problem with multiinput (parameters) and single output (likelihood). The RMSE is calculated by the following where, y i andŷ i are the true likelihood and the pseudolikelihood value respectively and N is the number of cases the surrogate is employed during sampling.

Surrogate model
The choice of the surrogate model needs to consider the computational resources taken for training the model during the sampling process. We note that Gaussian process, neural networks, and radial basis functions [47], have been popular choices for surrogates in the literature.
In our case, we consider the inference problem that can feature, dozens, hundreds to thousands of parameters, hence the model needs to be efficiently trained without taking lots of computational resources. Moreover, the flexibility of the model to have incremental training is also needed. Therefore, we rule out Gaussian process models since they have imitations in training given that the size of the dataset increases [48]. We use neural networks as the choice of the surrogate model in this study. The training data and neural network model can be formulated as follows.
The data given to the surrogate model is Φ and λ as in Equation (7), where Φ is the input and λ is the desired output of the model. The prediction of the model is denoted byλ. We explain the surrogate models used in the paper as follows.
In our surrogate model, we consider a single hidden layer feedforward neural network as shown below. Given input x t , f (x t ) is computed by a feedforward neural network with one hidden layer defined by the function where δ o and δ h are the bias weights for the output o and hidden h layer, respectively. v j is the weight which maps the hidden layer h to the output layer. w dh is the weight which maps x t to the hidden layer h and g is the activation function for the hidden and output layer units.
The only difference is that we use a different activation function g(.) We use ReLU (rectified linear unitary function) as the activation function. The learning or optimization task then is to iteratively update the weights and biases to minimize the cross entropy loss J (W, b). This can be done using gradient update of weights using Adam (adaptive moment estimation) learning algorithm [49] and stochastic gradient descent [50,51]. We experimentally evaluate them for training feedforward network for the surrogate model in the next section.

Design of Experiments
We provide an experimental study of the proposed surrogateassisted parallel tempering (SAPT-Bayeslands) framework for selected landscape evolution problems. We compare the results with our parallel tempering Bayeslands framework (PT-Bayeslands) presented in an earlier study [1]. The first part the experiments feature the accuracy of the surrogates in comparison with the actual model while the second part features the integration of SAPT for the Badlands model. We used Keras neural networks library [52] for implementation of the surrogate. The open-source software package along with benchmark problems and experimental results is given here 2 .
We first carry out an investigation of the effects of different surrogate training procedures and parameter evaluation for SAPT-Bayeslands using smaller problems. Afterwards, we apply the methodology to our selected landscape evolution problems. More specifically, the experiments are designed as follows.
• We generate a dataset for training and testing the surrogate for the Synthetic-Mountain and Continental-Margin landscape evolution problems. We use the neural network model for the surrogate and evaluate different training techniques.
• We evaluate if transfer of knowledge from previous surrogate interval is better than no transfer of knowledge for Synthetic-Mountain and Continental-Margin problems. Note this is done only with the data generated from previous step.
• We integrate the surrogate model into parallel tempering (SAPT-Bayeslands) and evaluate the effectiveness of the surrogate in terms of prediction of likelihood and overall time reduced is evaluated. Due to the requirement of extensive experimentation, only Synthetic-Mountain and Continental-Margin problems are considered.
• SAPT-Bayeslands is applied to the Tasmania landscape evolution problem and compared with PT-Bayeslands.
In SAPT-Bayeslands and PT-Bayeslands, we employ a random-walk proposal which is implemented by perturbing the chain in the respective replica with a small amount of Gaussian noise with a parameter specific step-size or standard deviation. The step-size β i for parameter i is chosen to be a combination of a fixed step size φ = 0.02, common to all parameters, multiplied by the range of possible values for parameter i so that where, a i and b i represent the maximum and minimum limits of the priors for parameter and are given in Table 2.
Similarly, the geometric temperature ladder with maximum temperature T max = 10 was used for determining the temperature level for for each of the replica. In trial experiments, the selection of these parameters depended on the accuracy. We used a replica exchange or swap interval, R swap = 10 that determines when to check whether to swap with the neighboring replica. In previous work [1], it was observed that increasing the number of replicas up to a certain point does not necessarily mean that the computational time is lowered or better sampling is achieved. In this work, we limit the number of replicas  as R num = 10 for all experiments along with fixed maximum samples of 10 000 samples. We use a 15% burn-in which discards the portion of initial samples. This is a standard practice required for convergence which shows that the sampling discards the invariant and only considers the joint posterior distribution. The performance quality of the SAPT-Bayeslands and PT-Bayeslands framework is evaluated in terms of total simulation time, and root-mean-squared-error (RMSE) of the predicted elevation and erosion-deposition in the topography.

Surrogate accuracy
In order to implement the surrogate model, we needs to evaluate the training algorithm such as Adam and stochastic gradient descent (SGD). Furthermore, we also evaluate certain parameters such as the size of the surrogate interval (batch-ratio), the neural network topology for the surrogate and the effectiveness of either training from scratch or to utilize previous knowledge for surrogate training (transfer and train). We create a training dataset from the cases where the true likelihood was used which compromises the history of the set of parameters proposed with the corresponding likelihood. This is done for standalone evaluation of the surrogate model which further ensures that the experiments are reproducible since different experimental runs will create different dataset depending on the exploration during sampling. Hence, we create a benchmark data set from history of samples proposed with their likelihood 3 . We then evaluate the neural network model designated for the surrogate using two major training algorithms which featured the Adam optimizer and stochastic gradient descent. The parameters that define the neural network surrogate model used for the experiments are given in Table 5. Note that the train size in Table 5 refers to the maximum size of the data set. The training is done in batches where the batch ratio determines the training data set size as shown in Table 6. Table 6 presents the results for the experiments that took account of the training data collected during sampling for two benchmark problems (Continental-Margin and Synthetic-Mountain). The MSE indicates the performance of the surrogates after the likelihood values (outcomes) in the dataset are normalized between [0,1]. Note that, we report the mean value of the mean-squared-error (MSE) for the given batch ratio from ten experiments. The batch ratio is taken in relation to the maximum number of samples across the chains (R max /R num ). Although in most cases, the accuracy of the neural network is slightly better when training from scratch with combined data, howsoever, there is a huge trade off with the time required to train the network. The results show that the transfer and train methodology in general requires much lower computational time when compared to training from scratch by combined data. Moreover, in comparison of SGD and Adam training algorithms, we observe that SGD achieves slightly better accuracy than Adam for Continental-Margin problem. However, Adam, having adaptive learning rate, outperforms SGD in terms of the time required to train the network. Thus, it can be summarized that transfer and train method is better since it saves significant computation time with a minor trade-off with accuracy.

Surrogate-assisted parallel tempering Bayeslands
In the experiments, we investigated the effects of the surrogate probability (s-prob) and surrogate interval (batch-ratio) on the the accuracy and time duration of the experiments. The accuracy of the prediction is evaluated by the mean square error (RMSE) of the predicted topography with the synthetic real topography, where elevation and erosion-deposition are reported. Note that the mean and standard deviation ( mean and std) of the accepted values of accuracy of prediction over the sampling is reported. The time is measured in seconds. Table 7 and 8 shows the performance of the respective methods (PT-Bayeslands and SAPT-Bayeslands) with respective parameter settings for the Continental-Margin and Synthetic-Mountain problem. We observe that for the results regarding SAPT-Bayeslands, there not a significant difference in accuracy of elevation or erosion-deposition prediction given different values of surrogate probability. Howsoever, there is a significant difference in terms of the computational time saved. It is evident that greater surrogate probability gives more usage of surrogates through which more computational time is saved. Furthermore, we notice that there is not a significant difference in accuracy of prediction or computational time given difference values of the batch-ratio. Figure 5 and 6 provides a visualization in the elevation prediction accuracy when compared to actual ground-truth between the two methods. Note that the prediction of erosion-deposition for 10 chosen points taken at selected locations shown in Table 4 is also given. Although both methods provide erosion-deposition prediction for 4 successive time intervals, we only show the final time interval due to lack of space for the respective problems. We notice that although the prediction accuracy is lower by SAPT-Bayeslands, the visualization shows that the mean prediction of the topography is close to ground-truth which is well covered by the credible interval. Figure 8 and Figure 9 show the the true likelihood and prediction by the surrogate for the Continental-Margin and Synthetic-Mountain problems, respectively. We notice that at certain intervals given in Figure 8, given by different replica, there is inconsistency in the predictions. Moreover, Figure 9 shows that the log-likelihood is very chaotic and hence there is difficulty in providing robust prediction at certain points in time given by samples for the respective replica. Table 9 gives the results for Tasmania which is a bigger and computationally expensive problem. We select a good combination of the set of parameters evaluated in the previous experiments (s-prob = 0.5 and batch-ratio is 0.15). We used maximum of 10 000 samples with 10 replicas. We by notice that the performance of SAPT-Bayeslands is similar to PT-Bayeslands as shown in Figure 7 while 41.27 percentage of time is saved.

Discussion
We observe that the surrogate probability is directly related to the computational performance; this is obvious since computational time depends on how often the surrogate is used. Our concern is about the prediction performance especially while increasing the use of the surrogate as it could lower the accuracy which can results in poor estimation of the parameters. According to the results, the accuracy is well retained we give higher probability to the use of surrogates. In general, SAPT-Bayeslands achieves a lower prediction accuracy when compared to PT-Bayeslands. However, given the cross-section visualization, we find that the accuracy given in prediction by the surrogate based framework is not so poor. Moreover, application to a more computationally intensive problem (Tasmania) shows that a significant reduction in computational time is achieved.
The initial evaluation for the setup surrogate model shows that its is best to use a transfer learning approach where the knowledge from the past surrogate interval is utilized and refined with new surrogate data. This consumes much less time than accumulating data and training the surrogate from scratch at every surrogate interval. We note that in cases when the surrogate model is used, there is no prediction given by the model. Hence, the predictions (elevation and erosion-deposition) during sampling are gathered only from the true Badlands model evaluation rather than the surrogate. In this way, one could argue that the surrogate model is not mimicking the true model; however, we are guiding the sampling algorithm towards forming better proposals without evaluation of the true model. A direction forward is in incorporating other forms of surrogates which could be in terms of running low resolution Badlands model as the surrogate which would be computationally faster in evaluating the proposals. Furthermore, computationally efficient implementations of landscape evolution models that only feature landscape evolution [53] could be used as the surrogate while Badlands that features both landscape evolution and erosion-deposition formation could be used as the true model. Computationally efficient implementations of landscape evolution models that consider parallel processing [54] could also be used in the Bayeslands framework. In this case, the challenge would be in allocating special processing cores for Badlands and others for parallel tempering.
The surrogate framework was adapted from [30] with major difference of featuring gradient-based proposals. Gradientbased learning or parameter estimation has been very popular in machine learning due to availability of gradient information. Due to the complexity in geological or geophysical numerical forward models, it is difficult to obtain gradients which has been the case of Badlands landscape evolution model. We use random-walk proposals which is a canonical sampling approach with a number of limitations. Hence, we need to incorporate advanced meta-heuristic techniques to form non-gradient based proposals for efficient search. Our study is limited to a fairly small seat of free parameters and a major challenge would be to develop surrogate models with an increased set of parameters.

Conclusions
We presented a novel application of surrogate-assisted parallel tempering that features parallel computing for landscape evolution models using Badlands. Initially, we experimented with two different approaches for training the surrogate model where we found that transfer learning based approach is beneficial and could help reduce computational time of the surrogate. Using this approach, we present the experiments that featured evaluating certain key parameters of the surrogate-based framework. In general, we observe that the proposed framework lowers computational time significantly, while maintaining the required quality in parameter estimation and uncertainty quantification.
In future work, we envision to apply the proposed framework to more complex applications such as evolution of continental-    scale landscapes and basins over millions of years. The approach could be used for other forward models such as those that feature geological reef development or lithospheric deformation. Furthermore, the posterior distribution of our parameters require multi-modal sampling methods; hence a combination of meta-heuristics for proposals with surrogate assisted parallel tempering could improve exploration features and also help in lowering the computational costs.