DecTree v1.0 – chemistry speedup in reactive transport simulations: purely data-driven and physics-based surrogates

De Lucia, Marco; Kühn, Michael

doi:https://doi.org/10.5194/gmd-14-4713-2021

Articles | Volume 14, issue 7

https://doi.org/10.5194/gmd-14-4713-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-14-4713-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 14, issue 7

Model evaluation paper

|

29 Jul 2021

Model evaluation paper |

| 29 Jul 2021

DecTree v1.0 – chemistry speedup in reactive transport simulations: purely data-driven and physics-based surrogates

Marco De Lucia and Michael Kühn

Download

Final revised paper (published on 29 Jul 2021)
Preprint (discussion started on 02 Mar 2021)

Interactive discussion

Status: closed

RC1:
'Comment on gmd-2020-445', Anonymous Referee #1, 08 Mar 2021

This is a well-conceived paper on creating surrogate regressor models to speed up the expensive physics solvers found in geochemical models. The authors present clever feature engineering techniques based on geochemical knowledge and understand their modeling system well by employing a Tweedie distribution to scale their training data. The use of a mass-balance metric to call the original solver once the machine learning (ML) model starts to drift is interesting. They present a balanced view on using machine learning and combining it with physics-informed domain knowledge. The paper fits into the scope of GMD.

Overall the manuscript needs very minor revision but better justifications for the ML models must be made as they currently remain overly-simplistic and potentially misleading.

Suggestions:

~Line 60: “For this reason, we argue that the most sensible choice for a surrogate modelling framework is that of multiple multivariate regression: one multivariate regressor - making use of many or all inputs as predictors - is trained independently for each distinct output variable, while the choice of regressor may vary from variable to variable.”

So, each output variable has its own ML model? This seems like it is not well explained, and a reader might believe this to be more computationally expensive. A better framing is that it is difficult to get all output variables to perform well with a single ML model. A citation that shows this here:

Kelp, M. M., Jacob, D. J., Kutz, J. N., Marshall, J. D., & Tessum, C. W. (2020). Toward stable, general machineâlearned models of the atmospheric chemical system. Journal of Geophysical Research: Atmospheres, 125, e2020JD032759. https://doi.org/10.1029/2020JD032759

where you can optimize a specific output variable’s performance above others (thereby retaining the same input/output dimensionality), or just simply make new ML models for each output. This shows that one ML model is not effective in capturing all variables and that multiple specialized models must be made.

~Line 66: “In praxis, the CPU-time, the user interactions and the overall skills required for optimally training complex regressors cannot be underestimated and may prove overwhelming for a geoscientist. The whole process of hyperparameter tuning, required by most advanced machine learning algorithms, while being an active area of development and research, is still hardly fully automatable.”

I do not like this reasoning too much, yes neural networks (NNs) take longer and are more difficult to train, but they evaluate faster than regression trees and can take better advantage of GPUs for an even greater speed gain.

Furthermore, grid search, Bayesian hyperparameter tuning, genetic neural algorithms are all automatable hyperparameter tuning methods that would be simple to implement with your 1-D, low-dimensional system (though I am not suggesting you do that here).

~Line 85: “Since chemistry is inherently an embarassing parallel task, the speedup achieved on a single CPU as in this work will transfer - given large enough simulation grids making up for the overhead - on parallel computations.”

Not sure what this means. Gas-phase chemistry is already parallelizable in air quality applications of atmospheric transport.

Paragraph starting at Line 210: “The choice of the regressor for each output is actually arbitrary, and nothing forbids to have different regressors for each variables, or even different regressors in different regions of parameter space of each variable. Without going into details on all kinds of algorithms that we tested, we found that decision-tree based methods such as Random Forest and their recent gradient boosting evolutions appear the most flexible and successful for our purposes. Their edge can in our opinion be resumed by: (1) implicit feature selection by construction, meaning that the algorithm automatically recognizes which input variables are most important for the estimation of the output; (2) no need for standardisation of both inputs and outputs; (3) ability to deal with any probability distributions; (4) fairly quick to train with sensible hyperparameter defaults; (5) extremely efficient implementations available.”

I do not like the reasoning of this paragraph from an ML perspective. In terms of your reasoning:

1) The feature selection automation from RFs are not accurate if your inputs are colinear in any way, and I do not think your 7 features are completely independent of one another, 2) your next paragraph makes sense about this though you contradict it immediately in the following paragraph when writing about scaling outputs. I would not say that this is too much of an advantage, it just removes an annoyance, 3) most other ML algorithms other than Gaussian processes can deal with any type of distribution as they are (for the most part) nonparametric learners, 4) they are quicker to train than NNs but also slower to evaluate than a NN, 5) perhaps, but I would not say that a NN is much harder when using Python or R.

The main edge in terms of using an RF rather than a NN as a surrogate is that an RF will always fall back on data it has seen and predict a mean state when it does not know what to output, whereas a NN is a regressor that will extrapolate outside of its training domain and accumulate errors much faster than an RF. Thus, RF (or xgboost) is typically more robust than a NN.

Figure 2, 12, 13: A little confused here, what are the axes labels?

~Line 278: Can xgboost use a GPU? I know that RFs cannot really use them but perhaps a factor of 10x gain may be had by using a GPU in the future.

Figure 14: Why do you not call the PHREEQC solver here? Did it never fail the mass balance criteria or is that not implemented here. Wondering what would happen if you routinely called the PHREEQC solver (e.g. at every 10% of simulation progress time), and ran it for several iterations if the physics solver can be used to dampen error?

General clarifications:

For your simulations, you call the full physics solver when the output variables fail the mass balance screening. Is it the case that once the surrogate fails the mass balance test the solver is just continually called or is there a dampening of error from using the full solver? Do you expect this relationship to change with a higher-dimensional system or higher grid resolution (2D, 3D model)?

I thought the mass balance screening + feature engineering justification + Figure 11 reasoning was excellent. Well done!

Citation: https://doi.org/10.5194/gmd-2020-445-RC1
- AC3: 'Reply on RC1', Marco De Lucia, 26 May 2021
  
  We thank the anonymous reviewer for his comments and suggestions.
  Comment: Line 60: "For this reason, we argue that the most sensible choice for a surrogate modelling framework is that of multiple multivariate regression: one multivariate regressor - making use of many or all inputs as predictors - is trained independently for each distinct output variable, while the choice of regressor may vary from variable to variable."
  So, each output variable has its own ML model? This seems like it is not well explained, and a reader might believe this to be more computationally expensive. A better framing is that it is difficult to get all output variables to perform well with a single ML model. A citation that shows this here (https://doi.org/10.1029/2020JD032759) where you can optimize a specific output variable's performance above others (thereby retaining the same input/output dimensionality), or just simply make new ML models for each output. This shows that one ML model is not effective in capturing all variables and that multiple specialized models must be made.
  Response: We agree with this suggestion and rephrased the whole paragraph, including the suggested citation, in order to clarify our choice in pursuing multiple multivariate regression.
  The new text reads:
  "With algorithms such as Artificial Neural Networks (ANN) it is possible to train one single network and hence in practice one single surrogate model for all output variables at once. While ANN in particular usually require long CPU times for training and quite large training datasets, they offer large speedups when used for predictions (Jatnieks et al., 2016; Prasianakis et al., 2020) and furthermore they can efficiently leverage GPUs (Graphic Processing Units) for even larger acceleration. It is however difficult to achieve the required accuracy simultaneously for all output variables (Kelp et al., 2020). For this reason, we focus on a more flexible approach, multiple multivariate regression: one distinct multivariate regressor - i.e., making use of many or all inputs as predictors - is trained independently for each distinct output variable. This approach allows using different specialized models from variable to variable, including different regression methods altogether, preprocessing and hyperparameter tuning, while not necessarily requiring larger computing resources."
  Comment: Line 66: "In praxis, the CPU-time, the user interactions and the overall skills required for optimally training complex regressors cannot be underestimated and may prove overwhelming for a geoscientist. The whole process of hyperparameter tuning, required by most advanced machine learning algorithms, while being an active area of development and research, is still hardly fully automatable."
  I do not like this reasoning too much, yes neural networks (NNs) take longer and are more difficult to train, but they evaluate faster than regression trees and can take better advantage of GPUs for an even greater speed gain. Furthermore, grid search, Bayesian hyperparameter tuning, genetic neural algorithms are all automatable hyperparameter tuning methods that would be simple to implement with your 1-D, low-dimensional system (though I am not suggesting you do that here).
  Response: It is indeed our opinion that the geoscientific modeller, with the actually required curricular skills, can be overwhelmed when confronted with complex tuning of AI models. This observation was rather directed at the necessity of increasing the weight of data science/AI in this domain than at the lack of fully automatic parameter tuning infrastructures/methods. In any case we reformulated this paragraph and mentioned the popularized frameworks for hyperparameter tuning in the discussion.
  Comment: Line 85: "Since chemistry is inherently an embarassing parallel task, the speedup achieved on a single CPU as in this work will transfer - given large enough simulation grids making up for the overhead - on parallel computations." Not sure what this means. Gas-phase chemistry is already parallelizable in air quality applications of atmospheric transport.
  Response: We mean here that, in an operator-splitting approach, at each coupling time step one single, independent simulation of chemistry is computed per each grid element. In this sense, this is a case of "embarassing parallelism". Rephrased accordingly.
  Comment: Paragraph starting at Line 210: "The choice of the regressor for each output is actually arbitrary, and nothing forbids to have different regressors for each variables, or even different regressors in different regions of parameter space of each variable. Without going into details on all kinds of algorithms that we tested, we found that decision-tree based methods such as Random Forest and their recent gradient boosting evolutions appear the most flexible and successful for our purposes. Their edge can in our opinion be resumed by: (1) implicit feature selection by construction, meaning that the algorithm automatically recognizes which input variables are most important for the estimation of the output; (2) no need for standardisation of both inputs and outputs; (3) ability to deal with any probability distributions; (4) fairly quick to train with sensible hyperparameter defaults; (5) extremely efficient implementations available." I do not like the reasoning of this paragraph from an ML perspective. In terms of your reasoning:
  
  1. The feature selection automation from RFs are not accurate if your inputs are colinear in any way, and I do not think your 7 features are completely independent of one another,
  2. your next paragraph makes sense about this though you contradict it immediately in the following paragraph when writing about scaling outputs. I would not say that this is too much of an advantage, it just removes an annoyance,
  3. most other ML algorithms other than Gaussian processes can deal with any type of distribution as they are (for the most part) nonparametric learners,
  4. they are quicker to train than NNs but also slower to evaluate than a NN,
  5. perhaps, but I would not say that a NN is much harder when using Python or R.
  The main edge in terms of using an RF rather than a NN as a surrogate is that an RF will always fall back on data it has seen and predict a mean state when it does not know what to output, whereas a NN is a regressor that will extrapolate outside of its training domain and accumulate errors much faster than an RF. Thus, RF (or xgboost) is typically more robust than a NN.
  Response: We agree with this comment. In our mind we did not actually intend to make a comparison between decision-tree methods and Neural Networks, however these remarks were incorporated in the manuscript since they are valuable to understand our reasoning. In detail:
  1. Chemistry is indeed a highly non-linear process, and this is reflected in a poor colinearity between the features. They are not independent, but the dependency is non-linear, how on can see in the exemplary choice of "engineered features" used in the physics-informed decision tree approach.
  2. we noted this fact about scaling of labels because it contradicts the documentation and the theory about decision tree based regressors such as xgboost.
  3. we agree with this remark and we removed this point.
  4. agreed and accordingly reformulated.
  5. removed. Added the robustness property here.
  Comment: Figure 2, 12, 13: A little confused here, what are the axes labels?
  Response: We added axes and labels in figure 2; the other figures displaying variables profiles across the 1D simulation domain are left unchanged since they act as graphical comparison and the axes are always the same.
  Comment: Line 278: Can xgboost use a GPU? I know that RFs cannot really use them but perhaps a factor of 10x gain may be had by using a GPU in the future.
  Response: Yes, a GPU backend is available for xgboost, however we did not try it. This has been now noted in the manuscript.
  Comment: Figure 14: Why do you not call the PHREEQC solver here? Did it never fail the mass balance criteria or is that not implemented here. Wondering what would happen if you routinely called the PHREEQC solver (e.g. at every 10% of simulation progress time), and ran it for several iterations if the physics solver can be used to dampen error?
  Response: In the "physics informed approach" we casted the geochemical problem in such a way that the mass balance is honored by construction, since the regressed variables are the $Delta$ calcite and $Delta$ dolomite, and the changes in aqueous concentrations depend linearly on them (Cl and pH are treated separately). This is in contrast to the purely data-driven approach, in which we basically assume no knowledge on the colinearity between concentration changes on mineral deltas. We now added some clarifications in the description of the physics-based approach.
  The second part of this question is addressed together with the next one. Briefly, calling PHREEQC does not dampen errors if the error or "unphysicality" is already present in the features at the beginning of the iterations.
  Comment: General clarifications: For your simulations, you call the full physics solver when the output variables fail the mass balance screening. Is it the case that once the surrogate fails the mass balance test the solver is just continually called or is there a dampening of error from using the full solver?
  Response: At each iteration the full physics solver is called only for the grid elements where the surrogates fail the mass balance (this is valid only for the "purely data-driven" approach). The process repeats in whole at the following iteration, regardless whether in the previous iteration a surrogate or a full physics simulation had been retained. Note (also w.r.t. reviewer's previous remark) that calling the full physics simulator resets "locally" the mass balance error for the given grid elements and time step. However, these calls do not ensure that the corresponding results are brought back to adhere perfectly to the reference or "true" state; we are only guaranteed that the new solution is completely physical, given the inputs at the start of the iteration. However, it may be that the accumulated inaccuracies in specific grid elements already produced diverging trajectories for the chemical process. We don't have a way, at runtime, to check or evaluate this, and we must rely on the assumption that the previous iterations did not introduce significant inaccuracies. As figure 3 displays, inaccuracies propagate in time - and hence in space - through the coupled simulations even when many full physics geochemical simulations are called. We added a paragraph explaining this reasoning to section 2.4.
  Comment: Do you expect this relationship to change with a higher-dimensional system or higher grid resolution (2D, 3D model)?
  Response: We argue that the dimensionality (2D or 3D) of the hydrodynamic problem does not have much impact on this fact, if all other hypotheses are equal to our present 1D example. In such cases one will however also have much larger datasets to train the surrogate in the first place. These concepts are now mentioned in manuscript's discussion.
  Of course things will change when considering variable local velocities, variable permeability and porosity and other hydrodynamic processes such as diffusion, or when time step must be treated as independent variable. There are possible workarounds there, some of which we briefly addressed in the discussion, but we believe this is outside the scope of the present paper.
  Comment: I thought the mass balance screening + feature engineering justification + Figure 11 reasoning was excellent. Well done!
  Response: Thank you.
  
  Citation: https://doi.org/10.5194/gmd-2020-445-AC3
RC2:
'Comment on gmd-2020-445', Glenn Hammond, 31 Mar 2021

“DecTree v1.0 - Chemistry speedup in reactive transport…” demonstrates the performance of two machine learning algorithms relative to PHREEQC (accessed through R-PHREEQC) for reactive transport motivated by Engesgaard and Kipp’s Test Case B from 1992. The algorithms include: (1) a fully-data driven backbox approach that trains multivariate regressors with PHREEQC and (2) a physics-based surrogate model that predicts future states based on nonlinear extrapolation of the current state through a trained decision tree. The latter is termed “feature engineering”. Both algorithms reproduce the PHREEQC result rather well. Algorithm #1 provides speedups up to 4x on the finest grid with error <~1% while #2 provides up to ~7x speedup with 0.3% error.

This model paper has technical merit and details new science, the feature engineering in algorithm #2. The authors acknowledge the simplifying assumptions in the conceptual model. For instance,

Line 445 - “The simplifications concerning the transport and the coupling itself (stationary flow; pure advection with dispersive full explicit forward Euler scheme; no feedback of chemistry on porosity and permeability; initially homogeneous medium; kinetic rate not depending on reactive surfaces) are obviously severe, but most of them should only marginally affect the validity…”

I believe that increasing complexity will more than “marginally affect” the results. The addition of high-ionic strength solution, aqueous speciation with a large pH swing, complex rate expressions for mineral precipitation-dissolution (transition state theory with temperature-dependent rate constants, rate limiters, prefactors, etc.) significantly increases nonlinearity and may degrade ML algorithm performance. It is unclear how these processes can be adequately simplified (i.e. for the physics-based surrogate approach). Perhaps this test problem would be marginally impacted, but for real-world reactive transport in heterogeneous porous media, the robustness of presented ML algorithms is not clear to me. But that is future research and does not diminish the results presented here.

Specific Comments:

In the context of parallel computing, how is training implemented across processes (e.g. in the case of distributed-memory computing)?

Line 21 – “Chemistry usually represents the bottleneck for coupled simulations with up to 90% of computational time”. Bear in mind that if only 90% of the run time if chemistry, infinite speedup of chemistry by ML will only provide 10x speedup overall.

Line 26 – “parallelization alone still is not sufficient to ensure numerical convergence of the simulations”. What is meant by the term “numerical convergence”? Is this convergence to an accurate solution? Grid or conceptual model refinement can improve accuracy, but HPC enables the solution of more unknowns in less time. If the convergence is nonlinear convergence, HPC seldom improves solver convergence and usually degrades it (hopefully slightly).

Line 49 – “Any regression algorithm can be employed to replace the “full physics” equation-based geochemical model.” If this is true, does it need to be stated?

Line 56 – beforehands -> beforehand

Line 107 – Please clarify what is meant by “Charge imbalance and redox potential (pe) can be safely disregarded for this redox-insensitive model…”

Line 123 – How is the term “state variables” an abuse of language?

Line 162 – A figure showing the inhomogeneous distribution of parameter space for a simple, real problem scenario would be a nice addition to the paper. Would Figure 5 work?

Line 203 – There is much discussion early in the manuscript regarding the choice of forward Euler and and CFL=1 to avoid numerical dispersion, but here the authors are attributing the difference in solution between the various grid resolutions to “numerical dispersion”. Would it be better to attribute the error to time truncation error? No mass should be diffusing between cells.

Line 280 – Please provide the basis for projecting speedups of 25-50. I believe that it would be better to demonstrate the speedups for 10^5-10^6 elements or with more complex chemistry, rather than to state it. This is especially the case with increasingly complex chemistry as the ML will be more complicated.

Line 310 – “…is the first, natural feature engineering tentative.” Is a bit confusing to me. “tentative” is an adjective. Could it state, “is the first feature to be engineered”?

Line 312 – “…, but also not relying on.” Reword?

Line 339 – in facts to -> in fact to

Line 379 – If a different “learning” is required for each grid, please elaborate regarding what “learning” would be required for an unstructured or non-uniform grid? Would each grid cell have its own learning? Such a discussion would make the paper more informative for real-world applications.

Line 430 – How would the authors handle variable time step size?

Figure 12 (~Line 439) – Moving the upper (Reference/Surrogate) legend to the middle of the right side would improve the figure.

Line 441 – with the own -> with their own

Figure 13 (Line ~442) – What causes the oscillatory behavior for pH at late times for Grid 200?

Line 459 – Please explain how the simplifications only marginally affect the validity of the benchmarks concerning achievable speedup. From previous discussion, it is not clear that the algorithms employed in the benchmarks can handle increased complexity (non-uniform flow velocity, non-uniform grids, nun-uniform time stepping, heterogeneity, …). If that is the case, the simplifications invalidate the algorithm for complex problems.

Citation: https://doi.org/10.5194/gmd-2020-445-RC2
- AC1: 'Reply on RC2', Marco De Lucia, 26 May 2021
  
  Many thanks to Dr. Hammond for his valuable and constructive comments.
  Comment: Line 445 - "The simplifications concerning the transport and the coupling itself (stationary flow; pure advection with dispersive full explicit forward Euler scheme; no feedback of chemistry on porosity and permeability; initially homogeneous medium; kinetic rate not depending on reactive surfaces) are obviously severe, but most of them
  
  should only marginally affect the validity..." I believe that increasing complexity will more than "marginally affect" the results. The addition of high-ionic strength solution, aqueous speciation with a large pH swing, complex rate expressions for mineral precipitation-dissolution (transition state theory with temperature-dependent rate constants, rate limiters, prefactors, etc.) significantly increases nonlinearity and may degrade ML algorithm performance. It is unclear how these processes can be adequately simplified (i.e. for the physics-based surrogate approach). Perhaps this
  
  test problem would be marginally impacted, but for real-world reactive transport in heterogeneous porous media, the robustness of presented ML algorithms is not clear to me. But that is future research and does not diminish the results presented here.
  
  Response: We agree on the poor formulation of this thesis. The corresponding discussion has been thoroughly rephrased.
  Comment: In the context of parallel computing, how is training implemented across processes (e.g. in the case of distributed-memory computing)?
  
  Response: We did not address nor face this problem for this paper. We assumed that all the results from the "training simulations" are available from the beginning to the ML algorithm for training and test. This is implicit when we write "pre-calculated training dataset.". Once the whole dataset is available, it depends on the regressor and its implementation whether or not the training can be run in parallel or not, and on which hardware. In the provided examples and datasets we always used only one CPU except for training of xgboost which used 4 threads in a shared memory machine, as reported in the manuscript.
  This question is of course appropriate when considering a parallel reactive transport architecture in which the training of ML model begins at runtime together with the coupled reactive transport simulation. However it's not really possible to comprehensively answer this question, because it is largely dependent on the actual method used for
  
  regression, and specifically if it supports "incremental training" or not, and on which architecture it can be parallelized (GPU, shared memory, HPC cluster-MPI parallelization).
  Specifically for our physics-informed surrogate approach, at the moment we did not implement any "forests of trees" based on random subsets of the original training data set. This would be an obvious way to achieve parallelization for both the learning and the prediction phase.
  Comment: Line 21 "Chemistry usually represents the bottleneck for coupled simulations with up to 90% of computational time". Bear in mind that if only 90% of the run time if chemistry, infinite speedup of chemistry by ML will only provide 10x speedup overall.
  
  Response: We were imprecise in this sentence. Our own previous work and different cited authors report runtimes of coupled reactive transport simulations where geochemistry takes up routinely around 90 % of the total CPU time, i.e., with a moderately complex chemistry, but it can be as high as well over 99 %. We rephrased accordingly.
  Comment: Line 26 -- "parallelization alone still is not sufficient to ensure numerical convergence of the simulations". What is meant by the term "numerical convergence"? Is this convergence to an accurate solution? Grid or conceptual model refinement can improve accuracy, but HPC enables the solution of more unknowns in less time. If the convergence is nonlinear convergence, HPC seldom improves solver convergence and usually degrades it (hopefully slightly).
  
  Response: Agreed, we were again imprecise in this formulation. We meant - and now also write:
  "However, the problem of difficult numerical convergence for the geochemical sub-process, routinely encountered by many practitioners, is not solved by parallelisation."
  Comment: Line 49 -- "Any regression algorithm can be employed to replace the "full physics" equation-based geochemical model." If this is true, does it need to be stated?
  
  Response: No, we removed that phrase.
  Comment: Line 107 -- Please clarify what is meant by "Charge imbalance and redox potential (pe) can be safely disregarded for this redox-insensitive model..."
  
  Response: We clarified by reformulating the paragraph, which now reads:
  "The implemented advection relies on transport of total elemental concentrations instead of the actual dissolved species, an allowable simplification since all solutes are subjected to the same advection equation (Parkhurst et al., 2015). Total dissolved O, H and solution charge should be included among the state variables and thus transported, but since this problem is redox-insensitive, we can disregard charge imbalance and only transport pH instead of H and O,
  
  disregarding changes in water mass. pH is defined in terms of activity of protons:
  pH = − log10 [H+]
  and is hence not additive. If we further assume that the activity coefficient of protons stays constant throughout the simulation, the activity [H+] can be actually transported. The resulting simplified advective model shows absolutely insignificant errors when compared to the same problem simulated, e.g., with PHREEQC's ADVECTION keyword (not shown).
  Comment: Line 203 -- There is much discussion early in the manuscript regarding the choice of forward Euler and and CFL=1 to avoid numerical dispersion, but here the authors are attributing the difference in solution between the various grid resolutions to "numerical dispersion". Would it be better to attribute the error to time truncation error? No mass should be diffusing between cells.
  
  Response: To generate the reference simulations and thus the training data for this data-driven approach we used different grid refinements but fixed time steps across all of them. This means that in the coarser grids we generate numerical dispersion, which in turn makes the simulations actually different hydrodynamical problems rather than
  
  different refinements of the same hydrodynamical problem. Since however chemistry is exactly the same across the three grids, this choice allows us to:
  1. eliminate the influence of time step;
  
  2. increase the information content in the training dataset;
  
  3. spread the "perturbations" induced by transport (which is now not pure advection but advection + numerical dispersion) to a given partial geochemical result.
  If we did run all three reference simulations with their own nu=1 time-step, we would have obtained completely equal trajectories for the chemistry of each grid element, just sampled at different times. Introducing numerical dispersion in two of them also introduces non-linear changes in aqueous concentrations, thus "spreading" the actual sampling of chemistry. Hence, using the three reference simulations as training data "covers more ground" in the parameter space
  
  expected for chemistry, and adds information content to the dataset.
  Comment: Line 280 -- Please provide the basis for projecting speedups of 25-50. I believe that it would be better to demonstrate the speedups for 10^5-10^6 elements or with more complex chemistry, rather than to state it. This is especially the case with increasingly complex chemistry as the ML will be more complicated.
  
  Response: These figures were obtained by naive extrapolation of the trend lines of figure 4 and from unfinished work. We removed any speculation about achievable speedups.
  Comment: Line 459 -- Please explain how the simplifications only marginally affect the validity of the benchmarks concerning achievable speedup. From previous discussion, it is not clear that the algorithms employed in the benchmarks can handle increased complexity (non-uniform flow velocity, non-uniform grids, nun-uniform time stepping, heterogeneity, ...). If that is the case, the simplifications invalidate the algorithm for complex problems.
  
  Response: We addressed some of these issues now in the discussion. They open up such a large amount of topics that it is not possible to cover thoroughly; however, for example with variable time stepping and non-uniform grids, we devised some further improvements which will be object of future work. More details in the next specific answer.
  Comment: Line 430 -- How would the authors handle variable time step size?
  
  Response: One first possibility is represented by training the surrogates for a small, fixed time step and just repeat it within one chemistry iteration to match the new required time step, which we did for the physics-informed decision tree approach. Of course this would only allow simulation of multiples of the "training time step". Another immediate idea, based on the previous one but extending it to non-multiples of the training time step $Delta t_t$, is represented by performing interpolation of the predicted outputs between the $(n-1)cdotDelta t_t$ and $ncdotDelta t_t$ inner iterations (the samples used for interpolation may be more than two, of course) such that the required time step falls within that interval. There would be need to account for non-linearity of the interpolated variables and mass and charge balance would need to be reassessed.
  A more general way would be to treat time step as completely independent variable. This would mean that "time step size" is now an "input", a separate column in the training data required for the surrogate to operate. This would of course require a quite large dataset for training, which can be mitigated, in the reference reactive transport
  
  simulations with variable time stepping, by simply outputting partial results of chemistry within a required large time step.
  We outlined these ideas in the discussion section, clearly labelling them as "future work".
  Minor comments
  - Line 56 -- beforehands -> beforehand
  
  Corrected
  - Line 123 -- How is the term "state variables" an abuse of language?
  
  Rephrased
  - Line 162 -- A figure showing the inhomogeneous distribution of parameter space for a simple, real problem scenario would be a nice addition to the paper. Would Figure 5 work?
  
  While this is a valuable suggestion for future work, we do not believe it belongs in the manuscript.
  - Line 310 -- "...is the first, natural feature engineering tentative." Is a bit confusing to me. "tentative" is an adjective. Could it state, "is the first feature to be engineered"?
  
  Rephrased
  - Line 312 -- "..., but also not relying on." Reword?
  
  Amended
  - Line 339 -- in facts to -> in fact to
  
  Corrected
  - Line 379 -- If a different "learning" is required for each grid, please elaborate regarding what "learning" would be required for an unstructured or non-uniform grid? Would each grid cell have its own learning? Such a discussion would make the paper more informative for real-world applications.
  
  Added a paragraph to the discussion.
  - Figure 12 (Line 439) -- Moving the upper (Reference/Surrogate) legend to the middle of the right side would improve the figure.
  
  Corrected
  - Line 441 -- with the own -> with their own
  
  Corrected
  - Figure 13 (Line ~442) -- What causes the oscillatory behavior for pH at late times for Grid 200?
  
  These are inaccuracies introduced by the surrogate.
  
  Citation: https://doi.org/10.5194/gmd-2020-445-AC1
RC3:
'Comment on gmd-2020-445', P. Trinchero, 06 Apr 2021

The manuscript presents two different approaches to solve reactive transport problems using surrogate models. The first approach relies on purely data-driven regressors whereas in the second approach the system is informed by physical-based rules. The first approach should be considered as a black-box, although geochemical calculations are performed on demand when the mass balance error exceeds the selected tolerance, and as such its predictive capability degrades rapidly in the out of sample region. The physical-based approach provides a compartmentalization of the parameter space and thus allows simpler univariate regressions to be employed. More importantly, by tightening the system to the underlying physical/chemical processes, the physical-based surrogate model offers more reliable predictions, even in the out of sample region. The drawback of the physical-based approach is that the definition of the underlying rules might be tedious, which makes it less attractive for practitioners.

I think that the work is interesting for the geoscientific community and the manuscript is generally well written. However, I do have some major concerns that should be carefully considered and possibly addressed before the ms is accepted in its final form.

My first concern is that the limited scope of the proposed numerical exercise is not properly acknowledged and discussed. The proposed surrogate models are, in fact, tested using an over-simplified system consisting of a 1D domain and with a geochemical system defined by a few primary species and two mineral reactions only. Since the focus of the work in on the geochemical part of the reactive transport problem, it appears reasonable to consider a simple 1D system (although real media are characterized by complex flow and transport patterns that have a significant impact on the underlying reactions). However, real geochemical systems are also significantly more complex than the calcite dissolution problem considered here. They typically involve several primary species, a number of aqueous complexation reactions, different mineral assemblages and a number of additional non-linear reactions such as surface complexation, fractionation, redox reactions, etc. In the conclusion section the authors state that “The simplification …only marginally affect the validity of the benchmarks concerning the achievable speedup of geochemistry in a broad class of problems.”. I think that this statement is unfactual, since the use of the proposed framework for modelling complex reactive transport problems is not obvious and clearly it has not been demonstrated here. I warmly suggest to the authors to revise the discussion section by clearly stating what are the limits of this study and what is the left challenge required to model more complex and realistic systems using this surrogate-based approach.

My second concern is somehow related to the previous one: I found that parts of the manuscript lack of factuality. It is the case, for instance, of line 280, where it is said that “extrapolating to grids with 10^5 or 10^6 elements, speedups in the order of 25-50 are achievable for this chemical problem.”. Besides the fact that, according that what it is stated in the introduction (“Chemistry usually represents the bottleneck for coupled simulations with up to 90% of computational time”), the maximum theoretical speedup cannot be higher than 10x, I think that extrapolating from a grid of a few hundreds of cells to a model with a million element is simply not correct. Along the same line, arguing that numerical dispersion is an advantage since “it spreads the perturbation” (line 205) is at least questionable. More generally, the authors should use a plainer language and avoid expressions such as (in capital) “subjected to the VERY same advection equation”, “can be considered WITH AN ABUSE OF LANGUAGE state variables” “is NOTHING ELSE THAN the maximum value”, etc.

Other minor comments are stated below.

The list of references is a bit poor and is biased towards works performed by the authors. For instance, in line 25, when a short discussion on HPC is introduced, relevant references to efforts and achievements made to parallelise reactive transport problems should be included.

In line 30, a single reference (of one of the authors) is used to introduce the issue of heterogeneity of subsurface formations. Since heterogeneity is not addressed here, I suggest removing this paragraph. Otherwise, just consider that, within the branch of stochastic hydrology, there are several works that have addressed the influence of heterogeneity on reactive transport modelling and thus appropriate references should be included.

Line 80: change “computations run” with “computation runs”

Eq. (1): any reason why not to use a more stable and less restrictive implicit scheme?

“Note that equation 1 is not written in terms of porosity, so that effectively the Darcy velocity is assumed equal to the fluid flux or, alternatively, porosity is equal to unity. This assumption does not have any impact on the calculations besides the initial scaling of the system.”. This is unclear or at least I am missing something here. If porosity is constant (as it is in this work) the transport equation can be written either in terms of Darcy velocity or in terms of groundwater velocity. This does not imply any scaling.

“Methods such as kriging offer error models, based, e.g., on the distance of the estimated point from the nearest training data”. This is wrong. Kriging provides estimates of the variance given the distance and a chosen covariance function.

Line 235 “We used the max value of each label divided by 1·10-5 as scale.”. This seems to be very empirical. Please elaborate a bit more.

Most of the figures lack of axes and scales.

Last sentence at page 12. This sentence is not clear, please reformulate it.

Citation: https://doi.org/10.5194/gmd-2020-445-RC3
- AC2: 'Reply on RC3', Marco De Lucia, 26 May 2021
  
  Many thanks to Dr. Trinchero for his valuable comments.
  Comment: My first concern is that the limited scope of the proposed numerical exercise is not properly acknowledged and discussed. The proposed surrogate models are, in fact, tested using an over-simplified system consisting of a 1D domain and with a geochemical system defined by a few primary species and two mineral reactions only. Since the focus of the work in on the geochemical part of the reactive transport problem, it appears reasonable to consider a simple 1D system (although real media are characterized by complex flow and transport patterns that have a significant impact on the underlying reactions). However, real geochemical systems are also significantly more complex than the calcite dissolution problem considered here. They typically involve several primary species, a number of aqueous complexation reactions, different mineral assemblages and a number of additional non-linear reactions such as surface complexation, fractionation, redox reactions, etc. In the conclusion section the authors state that "The simplification... only marginally affect the validity of the benchmarks concerning the achievable speedup of geochemistry in a broad class of problems.". I think that this statement is unfactual, since the use of the proposed framework for modelling complex reactive transport problems is not obvious and clearly it has not been demonstrated here. I warmly suggest to the authors to revise the discussion section by clearly stating what are the limits of this study and what is the left challenge required to model more complex and realistic systems using this surrogate-based approach.
  Response: We agree with this comment, also raised by the second reviewer, and reformulated accordingly the discussion in the revised manuscript.
  Our original statement about "a broad class of problems" has been now precised. We intended here that there are many real life examples dominated by advection and with only two minerals: for example, flow-through experiments in controlled settings. We added a more detailed discussion of these issues in the reformulated manuscript discussion.
  Comment: My second concern is somehow related to the previous one: I found that parts of the manuscript lack of factuality. It is the case, for instance, of line 280, where it is said that "extrapolating to grids with 10$^5$ or 10$^6$ elements, speedups in the order of 25-50 are achievable for this chemical problem.". Besides the fact that, according that what it is stated in the introduction ("Chemistry usually represents the bottleneck for coupled simulations with up to 90% of computational time"), the maximum theoretical speedup cannot be higher than 10x, I think that extrapolating from a grid of a few hundreds of cells to a model with a million element is simply not correct.
  Response: The projected speedup figures are based on a naive extrapolation to larger grids and preliminary, unfinished simulation results. We accordingly removed this statement.
  Our own previous work and different cited authors report runtimes of coupled reactive transport simulations where geochemistry takes up routinely around 90 % of the total CPU time, i.e., with a moderately complex chemistry, but it can be as high as over 99 %. This is now correctly mentioned in the introduction in the revised manuscript.
  Comment: Along the same line, arguing that numerical dispersion is an advantage since "it spreads the perturbation" (line 205) is at least questionable.
  Response: We agree, and reformulated this phrase. What we meant is that this "spreading" is of advantage from the machine learning and sampling of parameter space standpoint. In our data-driven approach, we are using the "true" data from three coupled reactive transport simulations to train the surrogates. In this case, the three different grid resolutions have the exact same chemistry but different hydrodynamic settings, since the different $nu$ introduce numerical dispersion. If you think of transport as being an operator which "perturbs" the previous chemical state, the presence of numerical dispersion has the effect of spreading the aqueous concentrations around the "central values" that they would have if all the simulations were run with no numerical dispersion (assuming here dispersion-free simulations with the same time steps). Since two of our three simulations have dispersion, they introduce in the simulations and hence in the training dataset points which would have otherwise not been included.
  Comment: More generally, the authors should use a plainer language and avoid expressions such as (in capital) "subjected to the VERY same advection equation", "can be considered WITH AN ABUSE OF LANGUAGE state variables" "is NOTHING ELSE THAN the maximum value", etc.
  Response: We changed all these expressions.
  Comment: The list of references is a bit poor and is biased towards works performed by the authors. For instance, in line 25, when a short discussion on HPC is introduced, relevant references to efforts and achievements made to parallelise reactive transport problems should be included. In line 30, a single reference (of one of the authors) is used to introduce the issue of heterogeneity of subsurface formations. Since heterogeneity is not addressed here, I suggest removing this paragraph. Otherwise, just consider that, within the branch of stochastic hydrology, there are several works that have addressed the influence of heterogeneity on reactive transport modelling and thus appropriate
  
  references should be included.
  Response: We only partially agree with this critique. This work in fact does not deal at all with HPC, we just wanted to point out an argument which is in our opinion fundamental, even if "culturally difficult" to accept by many in the community. Throwing more CPUs at a reactive transport problem in order to obtain an accurate solution of a complex PDE can be regarded as overkill if one considers that there are many hidden and not so hidden uncertainties in the parametrization of that PDE. Concretely for reactive transport, there are huge uncertainties in the thermodynamics, kinetics, and the actual geological heterogeneity of subsurface. What we are trying to explore here is the possibility to obtain less accurate solutions but at a fraction of the computational cost. Of course, this manuscript only represents some first steps in this direction, and no definitive solution. Furthermore, we cited all peer-reviewed papers we know of which deal with model reduction and machine learning applied to reactive transport, and specifically substituting the geochemical process with surrogates. If we still missed any further, we would be grateful for any hint. Nevertheless we integrated some further references about HPC, uncertainty and heterogeneity in reactive transport simulations, while removing some own citations.
  Comment: Eq. (1): any reason why not to use a more stable and less restrictive implicit scheme?
  Response: First and foremost, we implemented the same advection scheme used by the PHREEQC simulator in order to have direct comparison when replacing the geochemical subprocess with surrogates. Our focus is namely on geochemistry, not on hydrodynamics. Furthermore, at this early stage we restricted our analysis to fixed time steps in order not have to deal with time as a free variable for geochemistry. Within this context, there is in our opinion no need for a better implementation of advection. These assumptions are now more clearly stated in the paper and also better discussed in the conclusion and future work.
  Comment: "Note that equation 1 is not written in terms of porosity, so that effectively the Darcy velocity is assumed equal to the fluid flux or, alternatively, porosity is equal to unity. This assumption does not have any impact on the calculations besides the initial scaling of the system.". This is unclear or at least I am missing something here. If porosity is constant (as it is in this work) the transport equation can be written either in terms of Darcy velocity or in terms of groundwater velocity. This does not imply any scaling.
  Response: We agree, and precised accordingly this statement in the revised manuscript. The advection is written in terms of groundwater or seepage velocity and the scaling only concerns how to setup the batch geochemical simulations in terms of minerals amounts. We used the convention of moles of minerals per kg of solvent in the porous medium.
  Comment: Methods such as kriging offer error models, based, e.g., on the distance of the estimated point from the nearest training data". This is wrong. Kriging provides estimates of the variance given the distance and a chosen covariance function.
  Response: If one has already performed kriging, then the variogram (or generalized covariance) model is already known and fixed. The error model (the kriging variance) is then cheap to compute. Nevertheless, we amended the sentence to avoid any misunderstanding.
  Comment: Line 235 "We used the max value of each label divided by 10-5 as scale.". This seems to be very empirical. Please elaborate a bit more.
  Response: It was indeed an empirical finding, only noted in the paper since it contradicts the corresponding statements in the documentation of xgboost and, generally speaking, of tree-based regression methods. It was found by trial and error. We have no further explanation for this issue; it could be an issue of the specific software package (xgboost) or of the values of our labels which are very small. These facts are now reported in the manuscript.
  Other comments
  - Line 80: change "computations run" with "computation runs"
  
  Corrected.
  - Most of the figures lack of axes and scales.
  
  We added scales and axes labels in the first figure (Figure 2) where variables' profiles are displayed. The following profiles are left unchanged since they only serve as graphical comparison.
  - Last sentence at page 12. This sentence is not clear, please reformulate it.
  
  Corrected.
  
  Citation: https://doi.org/10.5194/gmd-2020-445-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Marco De Lucia on behalf of the Authors (26 May 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Jun 2021) by Havala Pye

RR by P. Trinchero (28 Jun 2021)

ED: Publish as is (28 Jun 2021) by Havala Pye

AR by Marco De Lucia on behalf of the Authors (29 Jun 2021)

Short summary

DecTree evaluates a hierarchical coupling method for reactive transport simulations in which pre-trained surrogate models are used to speed up the geochemical subprocess, and equation-based full-physics simulations are called only if the surrogate predictions are implausible. Furthermore, we devise and evaluate a decision tree surrogate approach designed to inject domain knowledge of the surrogate by defining engineered features based on law of mass action or stoichiometric reaction equations.