the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Universal Differential Equations for glacier ice flow modelling
Facundo Sapienza
Fabien Maussion
Redouane Lguensat
Bert Wouters
Fernando Pérez
Abstract. Geoscientific models are facing increasing challenges to exploit growing datasets coming from remote sensing. Universal Differential Equations (UDEs), aided by differentiable programming, provide a new scientific modelling paradigm enabling both complex functional inversions to potentially discover new physical laws and data assimilation from heterogeneous and sparse observations. We demonstrate an application of UDEs as a proof of concept to learn the creep component of ice flow, i.e. a nonlinear diffusivity differential equation, of a glacier evolution model. By combining a mechanistic model based on a 2D Shallow Ice Approximation Partial Differential Equation with an embedded neural network, i.e. a UDE, we can learn parts of an equation as nonlinear functions that then can be translated into mathematical expressions. We implemented this modelling framework as ODINN.jl, a package in the Julia programming language, providing high performance, sourcetosource automatic differentiation (AD) and seamless integration with tools and global datasets from the Open Global Glacier Model in Python. We demonstrate this concept for 17 different glaciers around the world, for which we successfully recover a prescribed artificial law describing ice creep variability by solving ca 500,000 Ordinary Differential Equations in parallel. Furthermore, we investigate which are the best tools in the scientific machine learning ecosystem in Julia to differentiate and optimize large nonlinear diffusivity UDEs. This study represents a proof of concept for a new modelling framework aiming at discovering empirical laws for largescale glacier processes, such as the variability of ice creep and basal sliding for ice flow, and new hybrid surface mass balance models.
 Preprint
(3226 KB)  Metadata XML
 BibTeX
 EndNote
Jordi Bolibar et al.
Status: closed

RC1: 'Comment on gmd2023120', Douglas Brinkerhoff, 07 Jul 2023
 AC1: 'Response to Reviewer 1', Jordi Bolibar, 20 Sep 2023

RC2: 'Comment on gmd2023120', Anonymous Referee #2, 15 Aug 2023
Review of Bolibar et al., GMDD, 2023:Universal Differential Equations for glacier ice flow modellinghttps://doi.org/10.5194/gmd2023120The manuscript presents an approach that combines inverse methods ("data assimilation") with machine learningbased parameter estimation using the concept of Universal Differential Equations (UDEs) or Neural Ordinary Differential Equations (NODEs). The approach has the potential to enable a wider and more rigorous use of diverse observations to constrain or calibrate physical models. For it to work, the notion of differentiable programming becomes key, i.e., the PDE representing the physical model and the UDE embedded within the PDE need to be "differentiable" in the sense that an adjoint model needs to be generated. This adjoint model serves to compute the gradient of the "cost" or "loss" function with respect to uncertain/unkown parameters. The gradient, in turn, is an essential ingredient for gradientbased optimization. A versatile approach to generate the adjoint of the PDE/UDE system is by means of automatic differentiation.The approach is demonstrated by calibrating the glacier flow model ODINN.jl on a range of glacier observations from the Randolph Glacier Inventory. The example is simple, meant as a prototype demonstration of the method. The PDE consists of the SIA model for glacier flow. The parameterization within the SIA to be improved upon is Glen’s creep parameter A and its functional representation on the climate temperature normal T. Parameter(ization) A is replaced by a neutal network (NN), and the task is to estimate the NN parameters \theta.The manuscript concludes with a discussion of challenges and opportunities.The manuscript presents an exciting approach of combining physicsbased inverse modeling with machine learningbased approaches to improve the way that observational data may be used to rigorously constrain or calibrate models. The seamless integration of PDEs and UDEs for model calibration or parameter learning represents in my view a promising applications of machine learning for surrogate modeling in combination with physicsbased models. I recommend publication after minor revisions. My comments are largely intended to improve the manuscript's clarity.Main (but minor) comments:line 136137 and eqn. (3)"Instead of considering that the diffusivity D(θ) is the output of a universal approximator, we are going to replace the creep parameter A in Equation (3)..."I agree with this approach, but perhaps a sentence to motivate this choice might be warranted.Also, eqn. (7) seems to be the crux in connecting the PDE with the UDE. It might be worth, then, to be very explicit that A(θ) becomes the surrogate model embedded within the PDE. To connect eqn. (7) with the general eqn. (5), perhaps relate A(θ) with the universal approximator U(θ) defined in eqn. (5).line 141:"...by prescribing an artificial law for which a reference dataset is generated..."This is perhaps somewhat imprecise. As stated, it gives the impression that a reference dataset is generated for A, when in fact, the actual data set to be used (and generated via A) are the velocities obtained as solution and used in the loss function.lines 142/142:"...we have used the relationship between ice temperature and A from Cuffey and Paterson (2010), and replaced ice temperatures with a relationship between A and T"This is a crucial methodological step in this work to generate the reference data. This step should be described more clearly. I think what is done here is to provide a synthetic solution for a known functional form A(T) described in Cuffey and Patterson (2010), which then will be attempted to be inferred. I feel this relationship should be reproduced here to make this paper more accessible and selfcontained.Figure 2:

Perhaps two such schematics should be drawn, one using the functional form of A as taken from Cuffey and Patterson (2010) to generate the reference data (with added noise to A), and the other one as is(?)

The box for the "Physical law": wouldn't it make sense to state A as a function of the NN, instead of D, because the rest of the functional form of D remains unchanged?

Also, why is there an arrow from the PDE solver to the Mass balance box?
line 223 and again 523:"checkpointing": Accurate checkointing is not actually an interpolation between stored checkpoint. ADrelated checkpointing methods are designed to balance storing vs recomputation with the aim to recover the required state exactly. Griewank and Walther (2008), chapter 12 cover this.In the context of the latest implementation in Julia, see Schanen et al. (2023).Equation (8):A typo denominator, should be "\nabla_{\theta} A" ("A" missing)line 350:"Julia source code is automatically differentiable, ...": This is certainly too strong of a statement. While sometimes claimed by members of the core Julia community, it is currently an ambition but which clearly is not (yet) realized in its generality. To move toward this goal requires further developments in generalpurpose automatic differentiation. Arguably, the latest developments of Enzyme.jl (Moses et al. 2022) are a promising step in this direction.Other minor issues:line 5960 and again line 67:The statement "Nonetheless, all efforts so far have been applied to the inversion of scalar parameters, i.e. parameters that are stationary for a single inversion given a dataset." is not quite accurate (or I misunderstood it, in which case it may need clarification).The study by Goldberg et al. (2015; see references) inverts for timevarying and spatially distributed open boundary conditions. You may mean something else though with this sentence(?)line 118:The use of the term "observed" may be misleading in the present context. While true in general, the "observed" data used in study are actually simulated reference data (from a known functional relationship). It may be worth clarifying this. Also, to be notationally clear, these "observed"/simulated reference velocities are the elements V_1^k in eqn. (6), right?line 128:In the case where the \omega_k are the diagonal elements of a covariance matrix, this would essentially amount to a simple weighted leastsquares optimization problem (with weights equal to the the inverse variances).line 179:
"... output of the prescribed law ...": Just to clarify: the noise is added to the output of the prescribed law A(T), i.e., one generates A(t) + \epsilon, with \epsilon being the added noise?

Next sentence: "This setup is used to compute the reference solutions, ...": It might again be worth clarifying that these are the reference solutions on the velocities V_1^k used in eqn. (6).
Section 3.3It might be useful to replace everywhere "mass balance" (and "MB") by "surface mass balance" ("SMB").line 197:"H matrix": not defined what this refers to.Section 3.4, 3.5:I suggest merging these two sections under the title"Sensitivity methods, differentiation, and optimization"and convert current section 3.5 to 3.4.3.line 255:"unrepresented" > "underrepresented"line 257:"2i2c": please add a referenceline 263:"with respect TO a predictor"line 270/271:"until it finds an optimal nonoverfitted solution".Could you briefly elaborate how you determined that there is no overfitting.line 299301:How should one interpret this? Shouldn't the functional form A(T) be different in both cases then? I.e., the error in the SMB, should be compensated by the NN model of A(T)?line 306308:"This weak dependence...": This is certainly a valid perspective. However, if one generalizes the problem to also use altimetric data to constrain ice height, then this no longer applies and weak dependence on SMB may become problematic(?)Subsubsection 5.2.1:Automatic differentiation is independent of SciML. I would give it its own subsection.line 355:There is a newer reference available: Moses et al. 2021line 373, 380 (and elsewhere):What you call "backward mode" is usually referred to in the AD community as "reverse mode" (see, e.g., Griewank and Walther 2008).line 389:"the use of AD for optimization..."Not quite right. What you probably mean is " the use of adjoints for optimization". AD is merely a way (albeit a powerful one) to generate code, i.e., the adjoint operator, that efficiently computes the required gradient.line 399401:Sentences: "The gradient calculated by making variation of the parameter θ capture the variations of Err(θ,hyper(θ)), which lead to spurious gradients. On the other side, automatic differentiation compute gradients using one single evaluation of θ, meaning that it differentiates only the term Solver(u0, t0, t1,θ), being robust to the error term."I would love to see this formulated more clearly. As formulated, I find it not easy to understand.line 402:obtain > obtainedFigure A1:Should also define what the points indicated as diamonds are.Also, there exist a classification of numerical discretizations in the atmospheric modeling community, going back to Arakawa and Lamb (1977), e.g., https://en.wikipedia.org/wiki/Arakawa_gridsDoes your grid correspond to any of these?REFERENCES:Arakawa, A and Lamb, V.R., 1977: Computational design of the basic dynamical processes of the UCLA general circulation model. Methods in Computational Physics: Advances in Research and Applications, 17, 173–265. doi:10.1016/B9780124608177.500094Goldberg, D. N., Heimbach, P., Joughin, I. & Smith, B. (2015). Committed retreat of Smith, Pope, and Kohler Glaciers over the next 30 years inferred by transient model calibration. The Cryosphere, 9(6), 2429–2446. https://doi.org/10.5194/tc924292015Griewank, A. and A. Walther, 2008: Evaluating Derivatives. https://epubs.siam.org/doi/book/10.1137/1.9780898717761Moses, W. S., Churavy, V., Paehler, L., Hückelheim, J., Narayanan, S. H. K., Schanen, M. & Doerfert, J. (2021). ReverseMode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 00, 1–18. https://doi.org/10.1145/3458817.3476165Schanen, M., Narayanan, S. H. K., Williamson, S., Churavy, V., Moses, W. S. & Paehler, L. (2023). Transparent Checkpointing for Automatic Differentiation of Program Loops Through Expression Transformations. Lecture Notes in Computer Science, 483–497. https://doi.org/10.1007/9783031360244_37Citation: https://doi.org/10.5194/gmd2023120RC2  AC2: 'Response to Reviewer 2', Jordi Bolibar, 20 Sep 2023

Status: closed

RC1: 'Comment on gmd2023120', Douglas Brinkerhoff, 07 Jul 2023
 AC1: 'Response to Reviewer 1', Jordi Bolibar, 20 Sep 2023

RC2: 'Comment on gmd2023120', Anonymous Referee #2, 15 Aug 2023
Review of Bolibar et al., GMDD, 2023:Universal Differential Equations for glacier ice flow modellinghttps://doi.org/10.5194/gmd2023120The manuscript presents an approach that combines inverse methods ("data assimilation") with machine learningbased parameter estimation using the concept of Universal Differential Equations (UDEs) or Neural Ordinary Differential Equations (NODEs). The approach has the potential to enable a wider and more rigorous use of diverse observations to constrain or calibrate physical models. For it to work, the notion of differentiable programming becomes key, i.e., the PDE representing the physical model and the UDE embedded within the PDE need to be "differentiable" in the sense that an adjoint model needs to be generated. This adjoint model serves to compute the gradient of the "cost" or "loss" function with respect to uncertain/unkown parameters. The gradient, in turn, is an essential ingredient for gradientbased optimization. A versatile approach to generate the adjoint of the PDE/UDE system is by means of automatic differentiation.The approach is demonstrated by calibrating the glacier flow model ODINN.jl on a range of glacier observations from the Randolph Glacier Inventory. The example is simple, meant as a prototype demonstration of the method. The PDE consists of the SIA model for glacier flow. The parameterization within the SIA to be improved upon is Glen’s creep parameter A and its functional representation on the climate temperature normal T. Parameter(ization) A is replaced by a neutal network (NN), and the task is to estimate the NN parameters \theta.The manuscript concludes with a discussion of challenges and opportunities.The manuscript presents an exciting approach of combining physicsbased inverse modeling with machine learningbased approaches to improve the way that observational data may be used to rigorously constrain or calibrate models. The seamless integration of PDEs and UDEs for model calibration or parameter learning represents in my view a promising applications of machine learning for surrogate modeling in combination with physicsbased models. I recommend publication after minor revisions. My comments are largely intended to improve the manuscript's clarity.Main (but minor) comments:line 136137 and eqn. (3)"Instead of considering that the diffusivity D(θ) is the output of a universal approximator, we are going to replace the creep parameter A in Equation (3)..."I agree with this approach, but perhaps a sentence to motivate this choice might be warranted.Also, eqn. (7) seems to be the crux in connecting the PDE with the UDE. It might be worth, then, to be very explicit that A(θ) becomes the surrogate model embedded within the PDE. To connect eqn. (7) with the general eqn. (5), perhaps relate A(θ) with the universal approximator U(θ) defined in eqn. (5).line 141:"...by prescribing an artificial law for which a reference dataset is generated..."This is perhaps somewhat imprecise. As stated, it gives the impression that a reference dataset is generated for A, when in fact, the actual data set to be used (and generated via A) are the velocities obtained as solution and used in the loss function.lines 142/142:"...we have used the relationship between ice temperature and A from Cuffey and Paterson (2010), and replaced ice temperatures with a relationship between A and T"This is a crucial methodological step in this work to generate the reference data. This step should be described more clearly. I think what is done here is to provide a synthetic solution for a known functional form A(T) described in Cuffey and Patterson (2010), which then will be attempted to be inferred. I feel this relationship should be reproduced here to make this paper more accessible and selfcontained.Figure 2:

Perhaps two such schematics should be drawn, one using the functional form of A as taken from Cuffey and Patterson (2010) to generate the reference data (with added noise to A), and the other one as is(?)

The box for the "Physical law": wouldn't it make sense to state A as a function of the NN, instead of D, because the rest of the functional form of D remains unchanged?

Also, why is there an arrow from the PDE solver to the Mass balance box?
line 223 and again 523:"checkpointing": Accurate checkointing is not actually an interpolation between stored checkpoint. ADrelated checkpointing methods are designed to balance storing vs recomputation with the aim to recover the required state exactly. Griewank and Walther (2008), chapter 12 cover this.In the context of the latest implementation in Julia, see Schanen et al. (2023).Equation (8):A typo denominator, should be "\nabla_{\theta} A" ("A" missing)line 350:"Julia source code is automatically differentiable, ...": This is certainly too strong of a statement. While sometimes claimed by members of the core Julia community, it is currently an ambition but which clearly is not (yet) realized in its generality. To move toward this goal requires further developments in generalpurpose automatic differentiation. Arguably, the latest developments of Enzyme.jl (Moses et al. 2022) are a promising step in this direction.Other minor issues:line 5960 and again line 67:The statement "Nonetheless, all efforts so far have been applied to the inversion of scalar parameters, i.e. parameters that are stationary for a single inversion given a dataset." is not quite accurate (or I misunderstood it, in which case it may need clarification).The study by Goldberg et al. (2015; see references) inverts for timevarying and spatially distributed open boundary conditions. You may mean something else though with this sentence(?)line 118:The use of the term "observed" may be misleading in the present context. While true in general, the "observed" data used in study are actually simulated reference data (from a known functional relationship). It may be worth clarifying this. Also, to be notationally clear, these "observed"/simulated reference velocities are the elements V_1^k in eqn. (6), right?line 128:In the case where the \omega_k are the diagonal elements of a covariance matrix, this would essentially amount to a simple weighted leastsquares optimization problem (with weights equal to the the inverse variances).line 179:
"... output of the prescribed law ...": Just to clarify: the noise is added to the output of the prescribed law A(T), i.e., one generates A(t) + \epsilon, with \epsilon being the added noise?

Next sentence: "This setup is used to compute the reference solutions, ...": It might again be worth clarifying that these are the reference solutions on the velocities V_1^k used in eqn. (6).
Section 3.3It might be useful to replace everywhere "mass balance" (and "MB") by "surface mass balance" ("SMB").line 197:"H matrix": not defined what this refers to.Section 3.4, 3.5:I suggest merging these two sections under the title"Sensitivity methods, differentiation, and optimization"and convert current section 3.5 to 3.4.3.line 255:"unrepresented" > "underrepresented"line 257:"2i2c": please add a referenceline 263:"with respect TO a predictor"line 270/271:"until it finds an optimal nonoverfitted solution".Could you briefly elaborate how you determined that there is no overfitting.line 299301:How should one interpret this? Shouldn't the functional form A(T) be different in both cases then? I.e., the error in the SMB, should be compensated by the NN model of A(T)?line 306308:"This weak dependence...": This is certainly a valid perspective. However, if one generalizes the problem to also use altimetric data to constrain ice height, then this no longer applies and weak dependence on SMB may become problematic(?)Subsubsection 5.2.1:Automatic differentiation is independent of SciML. I would give it its own subsection.line 355:There is a newer reference available: Moses et al. 2021line 373, 380 (and elsewhere):What you call "backward mode" is usually referred to in the AD community as "reverse mode" (see, e.g., Griewank and Walther 2008).line 389:"the use of AD for optimization..."Not quite right. What you probably mean is " the use of adjoints for optimization". AD is merely a way (albeit a powerful one) to generate code, i.e., the adjoint operator, that efficiently computes the required gradient.line 399401:Sentences: "The gradient calculated by making variation of the parameter θ capture the variations of Err(θ,hyper(θ)), which lead to spurious gradients. On the other side, automatic differentiation compute gradients using one single evaluation of θ, meaning that it differentiates only the term Solver(u0, t0, t1,θ), being robust to the error term."I would love to see this formulated more clearly. As formulated, I find it not easy to understand.line 402:obtain > obtainedFigure A1:Should also define what the points indicated as diamonds are.Also, there exist a classification of numerical discretizations in the atmospheric modeling community, going back to Arakawa and Lamb (1977), e.g., https://en.wikipedia.org/wiki/Arakawa_gridsDoes your grid correspond to any of these?REFERENCES:Arakawa, A and Lamb, V.R., 1977: Computational design of the basic dynamical processes of the UCLA general circulation model. Methods in Computational Physics: Advances in Research and Applications, 17, 173–265. doi:10.1016/B9780124608177.500094Goldberg, D. N., Heimbach, P., Joughin, I. & Smith, B. (2015). Committed retreat of Smith, Pope, and Kohler Glaciers over the next 30 years inferred by transient model calibration. The Cryosphere, 9(6), 2429–2446. https://doi.org/10.5194/tc924292015Griewank, A. and A. Walther, 2008: Evaluating Derivatives. https://epubs.siam.org/doi/book/10.1137/1.9780898717761Moses, W. S., Churavy, V., Paehler, L., Hückelheim, J., Narayanan, S. H. K., Schanen, M. & Doerfert, J. (2021). ReverseMode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 00, 1–18. https://doi.org/10.1145/3458817.3476165Schanen, M., Narayanan, S. H. K., Williamson, S., Churavy, V., Moses, W. S. & Paehler, L. (2023). Transparent Checkpointing for Automatic Differentiation of Program Loops Through Expression Transformations. Lecture Notes in Computer Science, 483–497. https://doi.org/10.1007/9783031360244_37Citation: https://doi.org/10.5194/gmd2023120RC2  AC2: 'Response to Reviewer 2', Jordi Bolibar, 20 Sep 2023

Jordi Bolibar et al.
Jordi Bolibar et al.
Viewed
HTML  XML  Total  BibTeX  EndNote  

900  405  18  1,323  10  6 
 HTML: 900
 PDF: 405
 XML: 18
 Total: 1,323
 BibTeX: 10
 EndNote: 6
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1