Slate: extending Firedrake’s domain-specific abstraction to hybridized solvers for geoscience and beyond

Within the finite element community, discontinuous Galerkin (DG) and mixed finite element methods have become increasingly popular in simulating geophysical flows. However, robust and efficient solvers for the resulting saddle point and elliptic systems arising from these discretizations continue to be an ongoing challenge. One possible approach for addressing this issue is to employ a method known as hybridization, where the discrete equations are transformed such that classic static condensation and local post-processing methods can be employed. However, it is challenging to implement hybridization as performant parallel code within complex models whilst maintaining a separation of concerns between applications scientists and software experts. In this paper, we introduce a domain-specific abstraction within the Firedrake finite element library that permits the rapid execution of these hybridization techniques within a code-generating framework. The resulting framework composes naturally with Firedrake’s solver environment, allowing for the implementation of hybridization and static condensation as runtime-configurable preconditioners via the Python interface to the Portable, Extensible Toolkit for Scientific Computation (PETSc), petsc4py. We provide examples derived from second-order elliptic problems and geophysical fluid dynamics. In addition, we demonstrate that hybridization shows great promise for improving the performance of solvers for mixed finite element discretizations of equations related to large-scale geophysical flows.


Introduction
The development of simulation software is an increasingly important aspect of modern scientific computing, in the geosciences in particular. Such software requires a vast range of knowledge spanning several disciplines, ranging from applications expertise to mathematical analysis to highperformance computing and low-level code optimization. Software projects developing automatic code generation systems have become quite popular in recent years, as such systems help create a separation of concerns which focuses on a particular complexity independent from the rest. This allows for agile collaboration between computer scientists with hardware and software expertise, computational scientists with numerical algorithm expertise, and domain scientists such as meteorologists, oceanographers and climate scientists. Examples of such projects in the domain of finite element methods include FreeFEM++ (Hecht, 2012), Sundance (Long et al., 2010), the FEniCS Project (Logg et al., 2012a), Feel++ (Prud'Homme et al., 2012), and Firedrake (Rathgeber et al., 2016).
The finite element method (FEM) is a mathematically robust framework for computing numerical solutions of partial differential equations (PDEs) that has become increasingly popular in fluids and solids models across the geosciences, with a formulation that is highly amenable to code generation techniques. A description of the weak formulation of the PDEs, together with appropriate discrete function spaces, is enough to characterize the finite element problem. Both the FEniCS and Firedrake projects employ the Unified Form Language (UFL) (Alnaes et al., 2014) to specify the finite element integral forms and discrete spaces necessary to properly Published by Copernicus Publications on behalf of the European Geosciences Union.
define the finite element problem. UFL is a highly expressive domain-specific language (DSL) embedded in Python, which provides the necessary abstractions for code generation systems.
There are classes of finite element discretizations resulting in discrete systems that can be solved more efficiently by directly manipulating local tensors. For example, the static condensation technique for the reduction of global finite element systems (Guyan, 1965;Irons, 1965) produces smaller globally coupled linear systems by eliminating interior unknowns to arrive at an equation for the degrees of freedom defined on cell interfaces only. This procedure is analogous to the point-wise elimination of variables used in staggered finite difference codes, such as the ENDGame dynamical core (Melvin et al., 2010;Wood et al., 2014) of the UK Meteorological Office (Met Office) but requires the local inversion of finite element systems. For finite element discretizations of coupled PDEs, the hybridization technique provides a mechanism for enabling the static condensation of more complex linear systems. First introduced by Fraeijs de Veubeke (1965) and analyzed further by Brezzi and Fortin (1991), Cockburn et al. (2009a), and Boffi et al. (2013), the hybridization method introduces Lagrange multipliers enforcing certain continuity constraints. Local static condensation can then be applied to the augmented system to produce a reduced equation for the multipliers. Methods of this type are often accompanied by local post-processing techniques, which exploit the approximation properties of the Lagrange multipliers. This enables the manufacturing of fields exhibiting superconvergent phenomena or enhanced conservation properties (Arnold and Brezzi, 1985;Brezzi et al., 1985;Bramble and Xu, 1989;Stenberg, 1991;Cockburn et al., 2009bCockburn et al., , 2010b. These procedures require invasive manual intervention during the equation assembly process in intricate numerical code. In this paper, we provide a simple yet effective highlevel abstraction for localized dense linear algebra on systems derived from finite element problems. Using embedded DSL technology, we provide a means to enable the rapid development of hybridization and static condensation techniques within an automatic code generation framework. In other words, the main contribution of this paper is in solving the problem of automatically translating from the mathematics of static condensation and hybridization to compiled code. This automated translation facilitates the separation of concerns between applications scientists and computational/computer scientists and facilitates the automated optimization of compiled code. This framework provides an environment for the development and testing of numerics relevant to the Gung-Ho Project, an initiative by the UK Met Office in designing the next-generation atmospheric dynamical core using mixed finite element methods (Melvin et al., 2019). Our work is implemented in the Firedrake finite element library and the PETSc solver library (Balay et al., 1997(Balay et al., , 2019, accessed via the Python interface petsc4py (Dalcin et al., 2011).
The rest of the paper is organized as follows. We introduce common notation used throughout the paper in Sect. 1.1. The embedded DSL, called "Slate", is introduced in Sect. 2, which allows concise expression of localized linear algebra operations on finite element tensors. We provide some contextual examples for static condensation and hybridization in Sect. 3, including a discussion on post-processing. We then outline in Sect. 4 how, by interpreting static condensation techniques as a preconditioner, we can go further and automate many of the symbolic manipulations necessary for hybridization and static condensation. We first demonstrate our implementation on a manufactured problem derived from a second-order elliptic equation, starting in Sect. 5. The first example compares a hybridizable discontinuous Galerkin (HDG) method with an optimized continuous Galerkin method. Section 5.2 illustrates the composability and relative performance of hybridization for compatible mixed methods applied to a semi-implicit discretization of the nonlinear rotating shallow water equations. Our final example in Sect. 5.3 demonstrates time-step robustness of a hybridizable solver for a compatible finite element discretization of a rotating linear Boussinesq model. Conclusions follow in Sect. 6.

Notation
We begin by establishing notation used throughout this paper. Let T h denote a tessellation of ⊂ R n , the computational domain, consisting of polygonal elements K associated with a mesh size parameter h, and ∂T h = {e ∈ ∂K : K ∈ T h } the set of facets of T h . The set of facets interior to the domain is denoted by E • h := ∂T h ∂ . Similarly, we denote the set of exterior facets as E ∂ h := ∂T h ∩ ∂ . For brevity, we denote the finite element integral forms over T h and any facet set ⊂ ∂T h by where dx and ds denote appropriate integration measures. The operation · should be interpreted as standard multiplication for scalar functions or a dot product for vector functions. For any double-valued vector field w on a facet e ∈ ∂T h , we define the jump of its normal component across e by [[w]] e = w| e + · n e + + w| e − · n e − , e ∈ E • h w| e · n e , e ∈ E ∂ h , where + and − denote arbitrarily but globally defined sides of the facet. Here, n e + and n e − are the unit normal vectors with respect to the positive and negative sides of the facet e.

A system for localized algebra on finite element tensors
We present an expressive language for dense linear algebra on the elemental matrix systems arising from finite element problems. The language, which we call Slate, provides typical mathematical operations performed on matrices and vectors; hence the input syntax is comparable to high-level linear algebra software such as MATLAB. The Slate language provides basic abstract building blocks which can be used by a specialized compiler for linear algebra to generate low-level code implementations. Slate is heavily influenced by the Unified Form Language (UFL) (Alnaes et al., 2014;Logg et al., 2012a), a DSL embedded in Python which provides symbolic representations of finite element forms. The expressions can be compiled by a form compiler, which translates UFL into low-level code for the local assembly of a form over the cells and facets of a mesh. In a similar manner, Slate expressions are compiled to low-level code that performs the requested linear algebra element-wise on a mesh.

An overview of Slate
To clarify conventions and the scope of Slate, we start by establishing our notation for a general finite element form following the convention of Alnaes et al. (2014). We define a real-valued multi-linear form as an operator which maps a list of arguments v = (v 0 , · · ·, v α−1 ) ∈ V 0 ×· · ·×V α−1 into R: where a is linear in each argument v k . The arity of a form is α, an integer denoting the total number of form arguments. In traditional finite element nomenclature (for α ≤ 2), V 0 is referred to as the space of test functions and V 1 as the space of trial functions. Each V k are referred to as argument spaces. Forms with arity α = 0, 1, or 2 are best interpreted as the more familiar mathematical objects: scalars (0-forms), linear forms or functionals (1-forms), and bilinear forms (2-forms), respectively.
If a given form a is parameterized by one or more coefficients, say c = (c 0 , · · ·, c q ) ∈ C 0 × · · · × C q , where {C k } q k=0 are coefficient spaces, then we write From here on, we shall work exclusively with forms that are linear in v and possibly nonlinear in the coefficients c. This is reasonable since nonlinear methods based on Newton iterations produce linear problems via Gâteaux differentiation of a nonlinear form corresponding to a PDE (also known as the form Jacobian). We refer the interested reader to Alnaes et al. (2014, Sect. 2.1.2) for more details. For clarity, we present examples of multi-linear forms of arity α = 0, 1, and 2 that frequently appear in finite element discretizations: In general, a finite element form will consist of integrals over various geometric domains: integration over cells T h , interior facets E h • , and exterior facets E ∂ h . Therefore, we express a general multi-linear form in terms of integrals over each set of geometric entities: where is an integrand on the interior facet e ∈ E h • , and I E,∂ e is an integrand defined on the exterior facet e ∈ E ∂ h . The form a(c; v) describes a finite element form globally over the entire problem domain.
Here, we will consider the case where the interior facet integrands I E, • e (c; v) can be decomposed into two independent parts on each interior facet e: one for the positive restriction (+) and the negative restriction (−). That is, for each e ∈ E h • , we may write c; v). This allows us to express the integral over an interior facet e connecting two adjacent elements, say K + and K − , as the sum of integrals: The local contribution of Eq. (10) in each cell K, along with its associated facets e ⊂ ∂K, is then We call Eq. (12) the cell-local contribution of a(c; v), with To make matters concrete, let us suppose a(c; v) is a bilinear form with arguments By construction, A K,ij = 0 if and only if i and j take non-zero values in K. Now we introduce the cell-node map i = e(K,î) as the mapping from the local node numberî in K to the global node number i. Suppose there are n and m nodes defining the degrees of freedom for V 0 and V 1 , respectively, in K. Then all non-zero entries of A K,ij arise from integrals involving basis functions with local indices corresponding to the global indices i, j : These local contributions are collected in the n × m dense matrix A K , which we call the element tensor. The global matrix A is assembled from the collection of element tensors: For details on the general evaluation of finite element basis functions and multi-linear forms, we refer the reader to Kirby (2004), Kirby and Logg (2006), Logg et al. (2012b), andHomolya et al. (2018). Further details on the global assembly of finite element operators, with a particular focus on code generation, are summarized in the work of Logg and Wells (2010) and Markall et al. (2013).
In standard finite element software packages, the element tensor is mapped entry-wise into a global sparse array using the cell-node map e(K, ·). Within Firedrake, this operation is handled by PyOP2 (Rathgeber et al., 2012) and serves as the main user-facing abstraction for global finite element assembly. For many applications, one may want to produce a new global operator by algebraically manipulating different element tensors. This is relatively invasive in numerical code, as it requires bypassing direct operator assembly to produce the new tensor. This is precisely the scope of Slate.
Like UFL, Slate relies on the grammar of the hostlanguage: Python. The entire Slate language is implemented as a Python module which defines its types (classes) and operations on said types. Together, this forms a high-level language for expressing dense linear algebra on element tensors. The Slate language consists of two primary abstractions for linear algebra: 1. terminal element tensors corresponding to multi-linear integral forms (matrices, vectors, and scalars) or assembled data (for example, coefficient vectors of a finite element function) and 2. expressions consisting of algebraic operations on terminal tensors.
The composition of binary and unary operations on terminal tensors produces a Slate expression. Such expressions can be composed with other Slate objects in arbitrary ways, resulting in concise representations of complex algebraic operations on locally assembled arrays. We summarize all currently supported Slate abstractions here.

Terminal tensors
In Slate, one associates a tensor with data on a cell either by using a multi-linear form, or assembled coefficient data: -Tensor(a(c; v)) associates a form, expressed in UFL, with its local element tensor The form arity α of a K (c; v) determines the rank of the corresponding Tensor; i.e., scalars, vectors, and matrices are produced from scalars, linear forms, and bilinear forms, respectively. 1 The shape of the element tensor is determined by both the number of arguments and total number of degrees of freedom local to the cell.
-AssembledVector(f ), where f is some finite element function. The function f ∈ V is expressed in terms of the finite element basis . The result is the local coefficient vector of f on K: where e(K,î) is the local node numbering and n is the number of nodes local to the cell K.

Symbolic linear algebra
Slate supports typical binary and unary operations in linear algebra, with a high-level syntax close to mathematics. At the time of this paper, these include the following.
-A + B, the addition of two equal shaped tensors: -A * B, a contraction over the last index of A and the first index of B. This is the usual multiplicative operation on matrices, vectors, and scalars: A K B K .
--A, the additive inverse (negation) of a tensor: −A K .
-A.T, the transpose of a tensor: A K T .
-A.inv, the inverse of a square tensor: A K −1 .
-A.solve(B, decomposition=''..."), the result, X K , of solving a local linear system A K X K = B K , optionally specifying a factorization strategy.
-A.blocks [indices], where A is a tensor from a mixed finite element space. This allows for the extraction of sub-blocks, which are indexed by field (slices are allowed). For example, if a matrix A corresponds to the bilinear form a : V × W → R, where V = V 0 × · · · × V n and W = W 0 ×· · ·×W m are product spaces consisting of finite element spaces {V i } n i=0 , {W i } m i=0 , then the element tensors have the form The associated submatrix of Eq. (18) with indices i = (p, q), p = {p 1 , · · ·, p r }, q = {q 1 , · · ·, q c }, is where p ⊆ {0, · · ·, n}, q ⊆ {0, · · ·, m}.
Each Tensor object knows all the information about the underlying UFL form that defines it, such as form arguments, coefficients, and the underlying finite element space(s) it operates on. This information is propagated through as unary or binary transformations are applied. The unary and binary operations shown here provide the necessary algebraic framework for a large class of problems, some of which we present in this paper. In Firedrake, Slate expressions are transformed into lowlevel code by a linear algebra compiler. The compiler interprets Slate expressions as a syntax tree, where the tree is visited to identify what local arrays need to be assembled and the sequence of array operations. At the time of this work, our compiler generates C++ code, using the templated library Eigen (Guennebaud et al., 2015) for dense linear algebra. The translation from Slate to C++ is fairly straightforward, as all operations supported by Slate have a representation in Eigen.
The compiler pass will generate a single "macro" kernel, which performs the dense linear algebra operations represented in Slate. The resulting code will also include (often multiple) function calls to local assembly kernels generated by TSFC (Two Stage Form Compiler) (Homolya et al., 2018) to assemble all necessary sub-blocks of an element tensor. All code generated by the linear algebra compiler conforms to the application programming interface (API) of the PyOP2 framework, as detailed by Rathgeber et al. (2012, Sect. 3). Figure 1 provides an illustration of the complete tool chain.
Most optimization of the resulting dense linear algebra code is handled directly by Eigen. In the case of unary and binary operations such as A.inv and A.solve(B), stable default behaviors are applied by the linear algebra compiler. For example, A.solve(B) without a specified factorization strategy will default to using an in-place LU factorization with partial pivoting. For local matrices smaller than 5 × 5, the inverse is translated directly into Eigen's A.inverse(), which employs stable analytic formulas. For larger matrices, the linear algebra replaces A.inv with an LU factorization. 2 Currently, we only support direct matrix factorizations for solving local linear systems. However, it would not be difficult to extend Slate to support more general solution techniques like iterative methods.

Examples
We now present examples and discuss solution methods which require element-wise manipulations of finite element systems and their specification in Slate. We stress here that Slate is not limited to these model problems; rather these examples are chosen for clarity and to demonstrate key features of the Slate language. For our discussion, we use a model elliptic equation defined in a computational domain Figure 1. The Slate language wraps UFL objects describing the finite element system. The resulting Slate expressions are passed to a specialized linear algebra compiler, which produces a single macro kernel assembling the local contributions and executes the dense linear algebra represented in Slate. The kernels are passed to the Firedrake's PyOP2 interface, which wraps the Slate kernel in a mesh-iteration kernel. Parallel scheduling, code generation, and compilation occurs after the PyOP2 layer.

Hybridization of mixed methods
To motivate our discussion in this section, we start by recalling the mixed method for Eqs. (23)-(26). Methods of this type seek approximations (u h , p h ) in finite-dimensional sub- The space U h consists of H (div)-conforming piecewise vector polynomials, where choices of U (K) typically include the Raviart-Thomas (RT), Brezzi-Douglas-Marini (BDM), or Brezzi-Douglas-Fortin-Marini (BDFM) elements (Raviart and Thomas, 1977;Nédélec, 1980;Brezzi et al., 1985Brezzi et al., , 1987. The space V h is the Lagrange family of discontinuous polynomials. These spaces are of particular interest when simulating geophysical flows, since choosing the right pairing results in stable discretizations with desirable conservation properties and avoids spurious computational modes. We refer the reader to Cotter and Shipton (2012) where U h,0 is the subspace of U h with functions whose normal components vanish on ∂ N . The discrete system is obtained by first expanding the solutions in terms of the finite element bases: are bases for U h and V h , respectively. Here, U i and P i are the coefficients to be determined. As per standard Galerkin-based finite element methods, taking w = j , j ∈ {1, · · ·, N u } and φ = ξ j , j ∈ {1, · · ·, N p } in Eqs. (29)-(30) produces the following discrete saddle point system: Geosci. Model Dev., 13, 735-761, 2020 www.geosci-model-dev.net/13/735/2020/ are the coefficient vectors, and Methods to efficiently invert such systems include H (div)multigrid (Arnold et al., 2000) (requiring complex overlapping Schwarz smoothers), global Schur complement factorizations (which require an approximation to the inverse of the dense 3 elliptic Schur complement D + BA −1 B T ), or auxiliary space multigrid (Hiptmair and Xu, 2007). Here, we focus on a solution approach using a hybridized mixed method (Arnold and Brezzi, 1985;Brezzi and Fortin, 1991;Boffi et al., 2013).
The hybridization technique replaces the original system with a discontinuous variant, decoupling the velocity degrees of freedom between cells. This is done by replacing the discrete solution space for u h with the "broken" space U d h , defined as The vector finite element space U d h is a subspace of L 2 ( ) n consisting of local H (div) functions, but normal components are no longer required to be continuous on ∂T h . The approximation space for p h remains unchanged. Next, Lagrange multipliers are introduced as an auxiliary variable in the space M h , defined only on cell interfaces: where M(e) denotes a polynomial space defined on each facet. We call M h the space of approximate traces. Functions in M h are discontinuous across vertices in two dimensions and vertices or edges in three dimensions. Deriving the hybridizable mixed system is accomplished through integration by parts over each element K. Testing with w ∈ U d h (K) and integrating Eq. (23) over the cell K produces The trace function λ h is introduced in the surface integral as an approximation to p| ∂K . An additional constraint equation, called the transmission condition, is added to close the system. The resulting hybridizable formulation reads as follows: where M h,0 denotes the space of traces vanishing on ∂ D . The transmission condition Eq. (43) enforces both the continuity of u d h · n across element boundaries as well as the boundary condition u d h · n = g on ∂ N . If the space of Lagrange multipliers M h is chosen appropriately, then the broken velocity u d h , albeit sought a priori in a discontinuous space, will coincide with its H (div)-conforming counterpart. Specifically, the formulations in Eqs. (41)-(42) and (29)-(30) are solving equivalent problems if the normal components of w ∈ U h lie in the same polynomial space as the trace functions (Arnold and Brezzi, 1985). The discrete matrix system arising from Eqs. (41)-(43) has the general form where the discrete system is produced by expanding functions in terms of the finite element bases for U d h , V h , and M h like before. Upon initial inspection, it may not appear to be advantageous to replace our original formulation with this augmented equation set; the hybridizable system has substantially more total degrees of freedom. However, Eq. (44) has a considerable advantage over Eq. (32) in the following ways.
1. Since both U d h and V h are discontinuous spaces, U d and P are coupled only within the cell. This allows us to simultaneously eliminate both unknowns via local static condensation to produce a significantly smaller global (hybridized) problem for the trace unknowns, : where S ← {S K } K∈T h and E ← {E K } K∈T h are assembled via the local element tensors: www.geosci-model-dev.net/13/735/2020/ Geosci. Model Dev., 13, 735-761, 2020 Note that the inverse of the block matrix in Eqs. (46) and (47) is never evaluated globally; the elimination can be performed locally by performing a sequence of Schur complement reductions within each cell.
3. Once is computed, both U d and P can be recovered locally in each element. This can be accomplished in a number of ways. One way is to compute P K by solving followed by solving for U d K : Similarly, one could rearrange the order in which each variable is reconstructed.
4. If desired, the solutions can be improved further through local post-processing. We highlight two such procedures for U d and P, respectively, in Sect. 3.3. Figure 2 displays the corresponding Slate code for assembling the trace system, solving Eq. (45), and recovering the eliminated unknowns. For a complete reference on how to formulate the hybridized mixed system Eqs. (41)-(43) in UFL, we refer the reader to Alnaes et al. (2014). Complete Firedrake code using Slate to solve a hybridizable mixed system is also publicly available in Zenodo/Tabula-Rasa (2019, "Code verification"). We remark that, in the case of this hybridizable system, Eq. (44) contains zero-valued blocks which can simplify the resulting expressions in Eqs. (46)-(47) and (48)-(49). This is not true in general and therefore the expanded form using all sub-blocks of Eq. (44) is presented for completeness.

Hybridization of discontinuous Galerkin methods
The HDG method is a natural extension of discontinuous Galerkin (DG) discretizations. Here, we consider a specific HDG discretization, namely the LDG-H method (Cockburn et al., 2010b). Other forms of HDG that involve local lifting operators can also be implemented in this software framework by the introduction of additional local (i.e., discontinuous) variables into the definition of the local solver.
Deriving the LDG-H formulation follows exactly from standard DG methods. All prognostic variables are sought in the discontinuous spaces U h × V h ⊂ L 2 ( ) n × L 2 ( ). Within a cell K, integration by parts yields where U (K) and V (K) are vector and scalar polynomial spaces, respectively. Now, we define the numerical fluxes p and u to be functions of the trial unknowns and a new independent unknown in the trace space M h : where λ h ∈ M h is a function approximating p on ∂T h and τ is a positive stabilization function that may vary on each facet e ∈ ∂T h . We further require that λ h satisfies the Dirichlet condition for p on ∂ D in an L 2 -projection sense. The full LDG-H formulation reads as follows.
Equation (56) is the transmission condition, which enforces the continuity of u·n on ∂T h and q. Equation (57) ensures λ h satisfies the Dirichlet condition. This ensures that the numerical flux is single-valued on the facets. Hence, the LDG-H method defines a conservative DG method (Cockburn et al., 2010b). Note that the choice of τ has a significant influence on the expected convergence rates of the computed solutions.
The LDG-H method retains the advantages of standard DG methods while also enabling the assembly of reduced linear systems through static condensation. The matrix system arising from Eqs. (54)-(57) has the same general form as the hybridized mixed method in Eq. (44), except all sub-blocks are now populated with non-zero entries due to the coupling of trace functions with both p h and u h . However, all previous properties of the discrete matrix system from Sect. 3.1 still apply. The Slate expressions for the local elimination and reconstruction operations will be identical to those illustrated in Fig. 2. For the interested reader, a unified analysis of hybridization methods (both mixed and DG) for second-order elliptic equations is presented in Cockburn et al. (2009a) and Cockburn (2016).

Local post-processing
For both mixed (Arnold and Brezzi, 1985;Brezzi et al., 1985;Bramble and Xu, 1989;Stenberg, 1991) and discontinuous Galerkin methods (Cockburn et al., 2010b(Cockburn et al., , 2009b, it is possible to locally post-process solutions to obtain superconvergent approximations (gaining 1 order of accuracy over the unprocessed solution). These methods can be expressed as local solves on each element and are straightforward to implement using Slate. In this section, we present two postprocessing techniques: one for scalar fields and another for the vector unknown. The Slate code follows naturally from previous discussions in Sect. 3.1 and 3.2, using the standard set of operations on element tensors summarized in Sect. 2.1.

Post-processing of the scalar solution
Our first example is a modified version of the procedure presented by Stenberg (1991) for enhancing the accuracy of the scalar solution. This was also highlighted within the context of hybridizing eigenproblems by Cockburn et al. (2010a). This post-processing technique can be used for both the hybridizable mixed and LDG-H methods. We proceed by posing the finite element systems cell-wise.
Let P k (K) denote a polynomial space of degree ≤ k on a cell K ∈ T h . Then for a given pair of computed solutions u h , p h of the hybridized methods, we define the postprocessed scalar p h ∈ P k+1 (K) as the unique solution of the local problem: where 0 ≤ l ≤ k. Here, the space P ⊥,l k+1 (K) denotes the L 2 -orthogonal complement of P l (K). This post-processing method directly uses the definition of the flux u h , the approximation of −κ∇p. In practice, the space P ⊥,l k+1 (K) may be constructed using an orthogonal hierarchical basis, and solving Eqs. (58)-(59) amounts to inverting a symmetric positive definite system in each cell of the mesh.
At the time of this work, Firedrake does not support the construction of such a finite element basis. However, we can introduce Lagrange multipliers to enforce the orthogonality constraint. The resulting local problem then becomes the following mixed system: find (p h , ψ) ∈ P k+1 (K)×P l (K) such that where 0 ≤ l ≤ k. The local problems Eqs. (60)-(61) and Eqs. (58)-(59) are equivalent, with the Lagrange multiplier ψ enforcing orthogonality of test functions in P k+1 (K) with functions in P l (K). This post-processing method produces a new approximation which superconverges at a rate of k + 2 for hybridized mixed methods (Stenberg, 1991;Cockburn et al., 2010a). For the LDG-H method, k + 2 superconvergence is achieved when τ = O(1) and τ = O(h), but only k + 1 convergence is achieved when τ = O(1/ h) (Cockburn et al., 2009b(Cockburn et al., , 2010b. We demonstrate the increased accuracy in computed solutions in Sect. 5.1. An abridged example using Firedrake and Slate to solve the local linear systems is provided in Fig. 3.

Post-processing of the flux
Our second example illustrates a procedure that uses the numerical flux of an HDG discretization for Eqs. (23)-(26). Within the context of the LDG-H method, we can use the numerical trace in Eq. (52) to produce a vector field that is H (div)-conforming. The technique we outline here follows that of Cockburn et al. (2009b).
Let T h be a mesh consisting of simplices. On each cell K ∈ T h , we define a new function u h to be the unique element of the local Raviart-Thomas space [P k (K)] n +xP k (K) for all facets e on ∂K, where u is the numerical flux defined in Eq. (52). This local problem produces a new velocity u h with the following properties.
1. u h converges at the same rate as u h for all choices of τ producing a solvable system for Eqs. (54)-(57). However, 3. Additionally, the divergence of u h convergences at a rate of k + 1.
The Firedrake implementation using Slate is similar to the scalar post-processing example (see Fig. 3); the cell-wise linear systems Eqs. (62)-(63) can be expressed in UFL, and therefore the necessary Slate expressions to invert the local systems follows naturally from the set of operations presented in Sect. 2.1. We use the very sensitive parameter dependency in the post-processing methods to validate our software implementation in Zenodo/Tabula-Rasa (2019, "Codeverification").

Static condensation as a preconditioner
Slate enables static condensation approaches to be expressed very concisely. Nonetheless, the application of a particular approach to different variational problems using Slate still requires a certain amount of code repetition. By formulating each form of static condensation as a preconditioner, code can be written once and then applied to any mathematically suitable problem. Rather than writing the static condensation by hand, in many cases, it is sufficient to just select the appropriate, Slate-based, preconditioner. For context, it is helpful to frame the problem in the particular context of the solver library: PETSc. Firedrake uses PETSc as its main solver abstraction framework and can provide operator-based preconditioners for solving linear systems as PC objects expressed in Python via petsc4py (Dalcin et al., 2011). For a comprehensive overview on solving linear systems using PETSc, we refer the interested reader to Balay et al. (2019, Sect. 4).
Suppose we wish to solve a linear system: Ax = b. We can think of (left) preconditioning the system in residual form: by an operator P (which may not necessarily be linear) as a transformation into an equivalent system of the form Geosci. Model Dev., 13, 735-761, 2020 www.geosci-model-dev.net/13/735/2020/ The corresponding symbolic local tensors are defined in lines 9 and 11. The Slate expression for directly inverting the local system is written in line 12. In line 16, a Slate-generated kernel is produced which solves the resulting linear system in each cell. Since we are not interested in the multiplier, we only return the block corresponding to the new pressure field.
Given a current iterate x i the residual at the ith iteration is simply r i ≡ b−Ax i , and P acts on the residual to produce an approximation to the error i ≡ x − x i . If P is an application of an exact inverse, the residual is converted into an exact (up to numerical roundoff) error. We will denote the application of a particular Krylov subspace method (KSP) for the linear system Eq. (64) as K x (r (A, b)). Upon preconditioning the system via P as in Eq. (65), we write K x (Pr(A, b)).
If Eq. (66) is solved directly via P = A −1 , then Pr(A, b) = A −1 b − x. So Eq. (66) then becomes K x (r(I, A −1 b)), producing the exact solution of Eq. (64) in a single iteration of K. Having established notation, we now present our implementation of static condensation via Slate by defining the appropriate operator, P.

Interfacing with PETSc via custom preconditioners
The implementation of preconditioners for the systems considered in this paper requires the manipulation not of assembled matrices but rather their symbolic representation. To do this, we use the preconditioning infrastructure developed by Kirby and Mitchell (2018), which gives preconditioners written in Python access to the symbolic problem description. In Firedrake, this means all derived preconditioners have direct access to the UFL representation of the PDE system. From this mathematical specification, we manipulate this appropriately via Slate and provide operators assembled from Slate expressions to PETSc for further algebraic preconditioning. Using this approach, we have developed a static condensation interface for the hybridization of H (div) × L 2 mixed problems and a generic interface for statically condensing finite element systems. The advantage of writing even the latter as a preconditioner is the ability to switch out the solution scheme for the system, even when nested inside a larger set of coupled equations or nonlinear solver (Newton-based methods) at runtime.

A static condensation interface for hybridization
As discussed in Sect. 3.1 and 3.2, one of the main advantages of using a hybridizable variant of a DG or mixed method is that such systems permit the use of cell-wise static condensation. To facilitate this, we provide a PETSc PC static condensation interface: firedrake.SCPC. This preconditioner takes the discretized system as in Eq. (44) and performs the local elimination and recovery procedures. Slate expressions are generated from the underlying UFL problem description. More precisely, the incoming system has the form where X e is the vector of unknowns to be eliminated, X c is the vector of unknowns for the condensed field, and R e and R c are the incoming right-hand sides. The partitioning in Eq. (67) is determined by the solver option: pc_sc_eliminate_fields. Field indices are provided in the same way one configures solver options to PETSc. These indices determine which field(s) to statically condense into. For example, on a three-field problem (with indices 0, 1, and 2), setting -pc_sc_eliminate_fields 0,1 will configure firedrake.SCPC to cell-wise eliminate fields 0 and 1; the resulting condensed system is associated with field 2. The firedrake.SCPC preconditioner can be interpreted as a Schur complement method for Eq. (67) where S = A c,c − A c,e A −1 e,e A e,c is the Schur complement operator for the X c system. The distinction here from block preconditioners via the PETSc fieldsplit option (Brown et al., 2012), for example, is that P does not require global actions; by design A −1 e,e can be inverted locally and S is sparse. As a result, S can be assembled or applied exactly, up to numerical roundoff, via Slate-generated kernels.
In practice, the only globally coupled system requiring iterative inversion is S: where R s = R c − A c,e A −1 e,e R e is the condensed right-hand side and P 1 is a preconditioner for S. Once X c is computed, X e is reconstructed by inverting the system X e = A −1 e,e R c − A e,c X c cell-wise. By construction, this preconditioner is suitable for both hybridized mixed and HDG discretizations. It can also be used within other contexts, such as the static condensation of continuous Galerkin discretizations (Guyan, 1965;Irons, 1965) or primal-hybrid methods (Devloo et al., 2018). As with any PETSc preconditioner, solver options can be specified for inverting S via the appropriate options prefix (condensed_field). The resulting KSP for Eq. (69) is compatible with existing solvers and external packages provided through the PETSc library. This allows users to experiment with a direct method and then switch to a more parallel-efficient iterative solver without changing the core application code.

Preconditioning mixed methods via hybridization
The preconditioner firedrake.HybridizationPC expands on the previous one, this time taking an H (div)×L 2 system and automatically forming the hybridizable problem. This is accomplished through manipulating the UFL objects representing the discretized PDE. This includes replacing argument spaces with their discontinuous counterparts, introducing test functions on an appropriate trace space, and providing operators assembled from Slate expressions in a similar manner as described in Sect. 4.1.1.
More precisely, let AX = R be the incoming mixed saddle point problem, where R = R U R P T , X = U P T , and U and P are the velocity and scalar unknowns, respectively. Then this preconditioner replaces AX = R with the extended problem: where are the Lagrange multipliers, R = R U R P T , R U and R P are the right-hand sides for the flux and scalar equations, respectively, and · indicates modified matrices and covectors with discontinuous functions. Here, X = U d P T are the hybridizable (discontinuous) unknowns to be determined and C X = R g is the matrix representation of the transmission condition for the hybridizable mixed method (see Eq. 43).
The application of firedrake.HybridizationPC can be interpreted as the Schur complement reduction of Eq. (70): where S is the Schur complement matrix S = −C A −1 C T . As before, a single global system for can be assembled cell-wise using Slate-generated kernels. Configuring the solver for inverting S is done via the PETSc options prefix: -hybridization. The recovery of U d and P happens in the same manner as firedrake.SCPC.
Since the hybridizable flux solution is constructed in the broken H (div) space U d h , we must project the computed solution into U h ⊂ H (div). This can be done cheaply via local facet averaging. The resulting solution is then updated via U ← div U d , where div : U d h → U h is a projection operator. This ensures that the residual for the original mixed problem is properly evaluated to test for solver convergence. With P as in Eq. (71), the preconditioning operator for the original system AX = R then has the form We note here that assembly of the right-hand side for the system requires special attention. Firstly, when Neumann conditions are present, then R g is not necessarily 0. Since the hybridization preconditioner has access to the entire Python context (which includes a list of boundary conditions and the spaces in which they are applied), surface integrals on the exterior boundary are added where appropriate and incorporated into the generated Slate expressions. A more subtle issue that requires extra care is the incoming right-hand side tested in the H (div) space U h .
The situation we are given is that we have For consistency, we also require for any w ∈ U h that We can construct such an R U satisfying Eq. (73) in the following way. By construction, we have for each basis function are basis functions corresponding to the positive and negative restrictions associated with the ith facet node. 4 We then define our broken righthand side via the local definition: where N i is the number of cells that the degree of freedom corresponding to i ∈ U h is topologically associated with. Using Eq. (74), Eq. (75), and the fact that R U is linear in its argument, we can verify that our construction of R U satisfies Eq. (73).

Numerical studies
We now present results utilizing the Slate DSL and our static condensation preconditioners for a set of test problems. Since we are using the interfaces outlined in Sect. 4, Slate is accessed indirectly and requires no manually written solver code for hybridization or static condensation or local recovery. All parallel results were obtained on a single fully loaded compute node of dual-socket Intel E5-2630v4 (Xeon) processors with 2 × 10 cores (2 threads per core) running at 2.2GHz. In order to avoid potential memory effects due to the operating system migrating processes between sockets, we pin MPI processes to cores. The verification of the generated code is performed using parameter-sensitive convergence tests. The study consists of running a variety of discretizations spanning the methods outlined in Sect. 3. Details and numerical results are made public and can be viewed in Zenodo/Tabula-Rasa (2019) (see "Code and data availability"). All results are in full agreement with the theory.

HDG method for a three-dimensional elliptic equation
In this section, we take a closer look at the LDG-H method for the model elliptic equation (sign-definite Helmholtz): where f and g are chosen such that the analytic solution is p = exp{sin(π x) sin(πy) sin(π z)}. We use a regular mesh consisting of 6 · N 3 tetrahedral elements (N ∈ {4, 8, 16, 32, 64}). First, we reformulate Eqs. (76)-(77) as the 4 These are the two broken parts of i on a particular facet connecting two elements. That is, for two adjacent cells, a basis function in U h for a particular facet node can be decomposed into two basis functions in U d h defined on their respective sides of the facet. mixed problem: We start with linear polynomial approximations, up to cubic, for the LDG-H discretization of Eqs. (78)-(80). Additionally, we compute a post-processed scalar approximation p h of the HDG solution. This raises the approximation order of the computed solution by an additional degree. In all numerical studies here, we set the HDG parameter τ = 1. All results were computed in parallel, utilizing a single compute node (described previously). A continuous Galerkin (CG) discretization of the primal problem Eqs. (76)-(77) serves as a reference for this experiment. Due to the superconvergence in the post-processed solution for the HDG method, we use CG discretizations of polynomial orders 2, 3, and 4. This takes into account the enhanced accuracy of the HDG solution, despite being initially computed as a lower-order approximation. We therefore expect both methods to produce equally accurate solutions to the model problem.
Our aim here is not to compare the performance of HDG and CG, which has been investigated elsewhere (for example, see Kirby et al., 2012;Yakovlev et al., 2016). Instead, we provide a reference that the reader might be more familiar with in order to evaluate whether our software framework produces a sufficiently performant HDG implementation relative to what might be expected.
To invert the CG system, we use a conjugate gradient solver with Hypre's BoomerAMG implementation of algebraic multigrid (AMG) as a preconditioner (Falgout et al., 2006). For the HDG method, we use the preconditioner described in Sect. 4.1.1 and the same solver setup as the CG method for the trace system. While the trace operator is indeed symmetric and positive-definite, one should keep in mind that conclusions regarding the performance of off-theshelf AMG packages on the HDG trace system are still relatively unclear. As a result, efforts on developing more efficient multigrid strategies are a topic of ongoing interest (Cockburn et al., 2014;Kronbichler and Wall, 2018).
To avoid over-solving, we iterate to a relative tolerance such that the discretization error is minimal for a given mesh. In other words, the solvers are configured to terminate when there is no further reduction in the L 2 error of the computed solution compared with the analytic solution. This means we are not iterating to a fixed solver tolerance across all mesh resolutions. Therefore, we can expect the total number of Krylov iterations (for both the CG and HDG methods) to increase as the mesh resolution becomes finer. The rationale behind this approach is to directly compare the execution time to solve for the best possible approximation to the solution given a fixed resolution.

Error versus execution time
The total execution time is recorded for the CG and HDG solvers, which includes the setup time for the AMG preconditioner, matrix assembly, and the time to solution for the Krylov method. In the HDG case, we include all setup costs, the time spent building the Schur complement for the traces, local recovery of the scalar and flux approximations, and post-processing. The L 2 error against execution time and Krylov iterations to reach discretization error for each mesh are summarized in Fig. 4.
The HDG method of order k − 1 (HDG k−1 ) with postprocessing, as expected, produces a solution which is as accurate as the CG method of order k (CG k ). While the full HDG system is never explicitly assembled, the larger execution time is a result of several factors. The primary factor is that the total number of trace unknowns for the HDG 1 , HDG 2 , and HDG 3 discretizations is roughly 4, 3, and 2 times larger, respectively, than the corresponding number of CG unknowns. Therefore, each iteration is more expensive. We also observe that the trace system requires more Krylov iterations to reach discretization error, which appears to improve relative to the CG method as the approximation order increases. Further analysis on a multigrid methods for HDG systems is required to draw further conclusions. The main computational bottleneck in HDG methods is the global linear solver. We therefore expect our implementation to be dominated by the cost associated with inverting the trace operator. If one considers just the time-to-solution, the CG method is clearly ahead of the HDG method. However, the superior scaling, local conservation, and stabilization properties of the HDG method make it a particularly appealing choice for fluid dynamics applications (Yakovlev et al., 2016;Kronbichler and Wall, 2018). Therefore, the development of good preconditioning strategies for the HDG method is critical for its competitive use.

Breakdown of solver time
The HDG method requires many more degrees of freedom than CG or primal DG methods. This is largely due to the fact that the HDG method simultaneously approximates the primal solution and its velocity. The global matrix for the traces is significantly larger than the one for the CG system at low polynomial order. The execution time for HDG is then compounded by a more expensive global solve. Figure 5 displays a breakdown of total execution times on a simplicial mesh consisting of 1.5 million elements. The execution times have been normalized by the CG total time, showing that the HDG method is roughly 3 times more expensive than the CG method. This is expected given the larger degree-of-freedom count and expensive global solve. The raw numerical breakdown of the HDG and CG solvers are shown in Table 1. We isolate each component of the HDG method contributing to the total execution time. Local op-erations include static condensation (trace operator assembly), forward elimination (right-hand-side assembly for the trace system), backwards substitution to recover the scalar and velocity unknowns, and local post-processing of the scalar solution. For all k, our HDG implementation is solverdominated as expected.
Both trace operator and right-hand-side assembly are dominated by the costs of inverting a local square mixed matrix coupling the scalar and velocity unknowns, which is performed directly via an LU factorization. This is also the case for backwards substitution. They should all therefore be of the same magnitude in time spent. We observe that this is the case across all degrees, with times ranging between approximately 6 % and 11 % of total execution time for assembling the condensed system. Back-substitution takes roughly the same time as the static condensation and forward elimination stages (approximately 12 % of execution time on average). Finally, the additional cost of post-processing accrues negligible time (roughly 2 % of execution time across all degrees). This is a small cost for an increase in order of accuracy.
We note that the caching of local tensors does not occur. Each pass to perform the local eliminations and backwards reconstructions rebuilds the local element tensors. It is not clear at this time whether the performance gained from avoiding rebuilding the local operators will offset the memory costs of storing the local matrices. Moreover, in timedependent problems where the operators may contain statedependent variables, rebuilding local matrices will be necessary in each time step regardless.

Hybridizable mixed methods for the shallow water equations
A primary motivator for our interest in hybridizable methods revolves around developing efficient solvers for problems in geophysical flows. In this section, we present some results integrating the nonlinear, rotating shallow water equations on the sphere using test case 5 (flow past an isolated mountain) from Williamson et al. (1992). For our discretization approach, we use the framework of compatible finite elements (Cotter and Shipton, 2012;Cotter and Thuburn, 2014). The model equations we consider are the vector-invariant rotating nonlinear shallow water system defined on a twodimensional spherical surface embedded in R 3 : where u is the fluid velocity, D is the depth field, f is the Coriolis parameter, g is the acceleration due to gravity, b is the bottom topography, and (·) ⊥ ≡ẑ×·, withẑ being the unit normal to the surface . After discretizing in time and space See Appendix A for a complete description of the entire discretization strategy. The system Eq. (83) is the matrix equation corresponding to the linearized equations in Eqs. (A3)-(A4). The Picard updates U and D are sought in the mixed finite element spaces U h ⊂ H (div) and V h ⊂ L 2 , respectively. Stable mixed finite element pairings correspond to the wellknown RT and BDM mixed methods, such as RT k × DG k−1 or BDM k × DG k−1 . These also fall within the set of compatible mixed spaces ideal for geophysical fluid dynamics (Cotter and Shipton, 2012;Natale et al., 2016;Melvin et al., 2019). In particular, the lowest-order RT method (RT 1 × DG 0 ) on a structured quadrilateral grid (such as the latitudelongitude grid used by many operational dynamical cores) is analogous to the Arakawa C-grid finite difference discretization.
In staggered finite difference models, the standard approach for solving Eq. (83) is to neglect the Coriolis term and eliminate the velocity unknown U to obtain a discrete elliptic equation for D, where smoothers like Richardson iterations or relaxation methods are convergent. This is more problematic in the compatible finite element framewww.geosci-model-dev.net/13/735/2020/ Geosci. Model Dev., 13, 735-761, 2020 Figure 5. Breakdown of the CG k and HDG k−1 execution times on a 6 · 64 3 simplicial mesh.
work, since A has a dense inverse. Instead, we use the preconditioner described in Sect. 4.1.2 to form the equivalent hybridizable formulation, where both U and D are eliminated locally to produce a sparse elliptic equation for the Lagrange multipliers.

Atmospheric flow over a mountain
As a test problem, we solve test case 5 of Williamson et al. (1992), on the surface of a sphere with radius R = 6371km. We refer the reader to Cotter and Shipton (2012) and Shipton et al. (2018) for a more comprehensive study on mixed finite elements for shallow water systems of this type. We use the mixed finite element pairs (RT 1 , DG 0 ) (lowest-order RT method) and (BDM 2 , DG 1 ) (next-to-lowest-order BDM method) for the velocity and depth spaces. A mesh of the sphere is generated from seven refinements of an icosahedron, resulting in a triangulation T h consisting of 327 680 elements in total. The grid information for both mixed methods is summarized in Table 2. We run for a total of 25 time steps, with a fixed number of four Picard iterations in each time step. We compare the overall simulation time using two different solver configurations for the implicit linear system. First, we use a flexible variant of the generalized minimal residual method (GMRES) 5 acting on the system Eq. (83) with an approximate Schur complement preconditioner: 5 We use a flexible version of GMRES on the outer system since we use an additional Krylov solver to iteratively invert the Schur complement.
where S = M + gH t 2 4 Bdiag(A) −1 B T and diag(A) is a diagonal approximation to the velocity mass matrix (plus the addition of a Coriolis matrix). The Schur complement system is inverted via GMRES due to the asymmetry from the Coriolis term, with the inverse of S as the preconditioning operator. The sparse approximation S is inverted using PETSc's smoothed aggregation multigrid (GAMG). The Krylov method is set to terminate once the preconditioned residual norm is reduced by a factor of 10 8 . A −1 is computed approximately using a single application of incomplete LU (ILU) with zero fill-in.
Next, we use only the application of our hybridization preconditioner (no outer Krylov method), which replaces the original linearized mixed system with its hybridizable equivalent. After hybridization, we have the following extended problem for the Picard updates: Note that the space M h is chosen such that the trace functions, when restricted to a facet e ∈ ∂T h , are in the same polynomial space as u h ·n| e . Moreover, it can be shown that the Lagrange multiplier λ h is an approximation to the depth unknown tg D/2 restricted to ∂T h . The resulting three-field problem in Eqs. (85)-(87) produces the following matrix equation: where A is the discontinuous operator coupling X = U d D T and R X = R u R D T are the problem residuals. An exact Schur complement factorization is performed on Eq. (88), using Slate to generate the local elimination kernels. We use the same set of solver options for the inversion of S in Eq. (84) to invert the Lagrange multiplier system. The increments U d and D are recovered locally, using Slate-generated kernels. Once recovery is complete, U d is projected back into the conforming H (div) finite element space via U ← div U d . Based on the discussion in Geosci. Model Dev., 13, 735-761, 2020 www.geosci-model-dev.net/13/735/2020/ Table 3 displays a summary of our findings. The advantages of a hybridizable method versus a mixed method are more clearly realized in this experiment. When using hybridization, we observe a significant reduction in time spent in the implicit solver compared to the approximate Schur complement approach. This is primarily because we have reduced the number of "outer" iterations to zero; the hybridization preconditioner is performing an exact factorization of the global hybridizable system. This is empirically supported when considering per-application solve times. The values reported in Table 4 show the average cost of a single outer GMRES iteration (which includes the application of P SC ) and a single application of P hybrid . Hybridization and the approximate Schur complement preconditioner are comparable in terms of average execution time, with hybridization being slightly faster. This further demonstrates that the primary cause for the longer execution time of the latter is directly related to the additional outer iterations induced from using an approximate factorization. In terms of over all time-tosolution, the hybridizable methods are clearly ahead of the original mixed methods.
We also measure the relative reductions in the problem residual of the linear system Eq. (83). Our hybridization preconditioner reduces the residual by a factor of 10 8 on average, which coincides with the specified relative tolerance for the Krylov method on the trace system. In other words, the reduction in the residual for the trace system translates into an overall reduction in the residual for the mixed system by the same factor.
The test case was run up to day 15 on a coarser resolution (20 480 simplicial cells with x ≈ 210km) and a time-step size t = 500 s. Snapshots of the entire simulation are provided in Fig. 6 using the semi-implicit scheme described in Appendix A. The results we have obtained for days 5, 10, and 15 are comparable to the corresponding results of Nair et al. (2005), Ullrich et al. (2010, and Kang et al. (2020). We refer the reader to Shipton et al. (2018) for further demonstrations of shallow water test cases featuring the use of the hybridization preconditioner described in Sect. 4.1.2.

Hybridizable methods for a linear Boussinesq model
As a final example, we consider the simplified atmospheric model obtained from a linearization of the compressible Boussinesq equations in a rotating domain: where u is the fluid velocity, p the pressure, b the buoyancy, the planetary angular rotation vector, c the speed of sound (≈ 343ms −1 ), and N the buoyancy frequency (≈ 0.01s −1 ). Equations (90)-(92) permit fast-moving acoustic waves driven by perturbations in b. This is the model presented in Skamarock and Klemp (1994), which uses a quadratic equation of state to avoid some of the complications of the full compressible Euler equations (the hybridization of which we shall address in future work). We solve these equations subject to the rigid-lid condition u · n = 0 on all boundaries.
Our domain consists of a spherical annulus, with the mesh constructed from a horizontal "base" mesh of the surface of a sphere of radius R, extruded upwards by a height H . The vertical discretization is a structured one-dimensional grid, which facilitates the staggering of thermodynamic variables, such as b. We consider two kinds of meshes: one obtained by extruding an icosahedral sphere mesh and another from a cubed sphere.
Since our mesh has a natural tensor product structure, we construct suitable finite element spaces constructed by taking the tensor product of a horizontal space with a vertical space. To ensure our discretization is "compatible," we use the one-and two-dimensional finite element de Rham com- We can then construct the three-dimensional complex: www.geosci-model-dev.net/13/735/2020/ Geosci. Model Dev., 13, 735-761, 2020 Table 3. Preconditioner solve times for a 25-step run with t = 100s. These are cumulative times in each stage of the two preconditioners throughout the entire profile run. We display the average iteration count (rounded to the nearest integer) for both the outer and the inner Krylov solvers. The significant speedup when using hybridization is a direct result of eliminating the outermost solver.
Here, HCurl and HDiv denote operators which ensure that the correct Piola transformations are applied when mapping from physical to reference element. We refer the reader to McRae et al. (2016) for an overview of constructing tensor product finite element spaces in Firedrake. For the analysis of compatible finite element discretizations and their relation to the complex Eqs. (93)-(96), we refer the reader to Natale et al. (2016). Each discretization used in this section is constructed from more familiar finite element families, shown in Table 5.

Compatible finite element discretization
A compatible finite element discretization of Eqs. (90)-(92) constructs solutions in the following finite element spaces: whereW 2 h is the subspace of W 2 h ⊂ H (div) whose functions w satisfy w · n = 0 on ∂ , h is just the scalar version of the vertical velocity space. 6 That is, W b h and W 2,v h have the same number of 6 The choice of W b h in Eq. (97) corresponds to a Charney-Phillips vertical staggering of the buoyancy variable, which is the desired approach for the UK Met Office's Unified Model (Melvin Geosci. Model Dev., 13, 735-761, 2020 www.geosci-model-dev.net/13/735/2020/ Figure 6. Snapshots (view from the northern pole) from the isolated mountain test case. The surface height (m) at days 5, 10, and 15. The snapshots were generated on a mesh with 20 480 simplicial cells, a BDM 2 × DG 1 discretization, and t = 500 s. The linear system during each Picard cycle was solved using the hybridization preconditioner. Table 5. Vertical and horizontal spaces for the three-dimensional compatible finite element discretization of the linear Boussinesq model. The RT k and BDFM k+1 methods are constructed on triangular prism elements, while the RTCF k method is defined on extruded quadrilateral elements.

Compatible finite element spaces Mixed method
degrees of freedom but differ in how they are pulled back to the reference element.
To obtain the discrete system, we simply multiply Eqs. (90)-(92) by test functions w ∈W 2 h , φ ∈ W 3 h , and η ∈ W b h and integrate by parts. We introduce the increments δu h ≡ u n+1 h −u n h , and set u 0 ≡ u n h (similarly for δp h , p 0 , δb h , and b 0 ). Using an implicit midpoint rule discretization, we need to solve the following mixed problem at each time step: where the residuals are r u = − t(w, 2 × u 0 ) T h , r p = −c 2 t(φ, ∇ · u 0 ) T h , and r b = −N 2 t η, u 0 ·ẑ T h . The resulting matrix equations have the form where A u = M u + t 2 C , C is the asymmetric matrix associated with the Coriolis term, M u , M p , and M b are mass matrices, D is the weak divergence term, and Q is an operator containing the vertical components of δu h . In the absence of orography, we can use the point-wise expression for the buoyancy, and eliminate δb h from Eq. (101) by substituting Eq. (102) into Eq. (98). This produces the following mixed velocitypressure system: where are the modified velocity operator and right-hand side, respectively. Note that in our elimination strategy, A u corresponds to the bilinear form obtained after eliminating the buoyancy at the equation level: A similar construction holds for R u . Once Eq. (103) is solved, δb h is reconstructed by solving Equation (105) can be efficiently inverted using the conjugate gradient method.

Preconditioning the mixed velocity pressure system
The primary difficulty is finding efficient solvers for Eq. (103). This was studied by Mitchell and Müller (2016) within the context of developing a preconditioner which is robust against parameters, like t and mesh resolution. However, the implicit treatment of the Coriolis term was not taken into account. We consider two preconditioning strategies. One strategy proposed by Mitchell and Müller (2016) is to build a preconditioner based on the Schur complement factorization of A in Eq. (103): where H = M p + c 2 t 2 4 D A −1 u D T is the dense pressure Helmholtz operator. Because we have chosen to include the Coriolis term, the operator H is nonsymmetric and has the form where M u is a modified mass matrix. As t increases, the contribution of C becomes more prominent in H, making sparse approximations of H more challenging. We shall elaborate on this further below when we present the results of our second solver strategy. Our preferred strategy solves the hybridizable formulation of the system Eq. (103) The system Eqs.
(108)- (110) is automatically formed by the Firedrake preconditioner: firedrake.HybridizationPC. We then locally eliminate the velocity and pressure after hybridization, producing the following condensed problem: where · denotes matrices or vectors with discontinuous test and trial functions. The nonsymmetric operator H ∂ is inverted using a preconditioned generalized conjugate residual (GCR) Krylov method, as suggested in Thomas et al. (2003). For our choice of preconditioner, we follow strategies outlined in Elman et al. (2001) and employ an algebraic Geosci. Model Dev., 13, 735-761, 2020 www.geosci-model-dev.net/13/735/2020/ Figure 7. Buoyancy perturbation (y − z cross section) at t = 3600 s from a simple gravity wave test ( t = 100 s). The initial conditions (in lat-long coordinates) for the velocity are a simple solid-body rotation: u = 20e λ , where e λ is the unit vector pointing in the direction of decreasing longitude. A buoyancy anomaly is defined via b = d 2 d 2 +q 2 sin(πz/10 000), where q = Rcos −1 (cos(φ) cos(λ − λ φ )), d = 5000 m, R = 6371km/125 is the planet radius, and λ φ = 2/3. The equations are discretized using the lowest-order method RTCF 1 , with 24 576 quadrilateral cells in the horizontal and 64 extrusion levels. The velocity-pressure system is solved using hybridization. multigrid method (V cycle) with GMRES (five iterations) smoothers on the coarse levels. The GMRES smoothers are preconditioned with block ILU on each level. For the finest level, block ILU produces a line smoother (necessary for efficient solution on thin domains) when the trace variable nodes are numbered in vertical lines, as is the case in our Firedrake implementation. On the coarser levels, less is known about the properties of ILU under the AMG coarsening strategies, but as we shall see, we observe performance that suggests ILU is still behaving as a line smoother. More discussion on multigrid for nonsymmetric problems can be found in Bramble et al. (1994Bramble et al. ( , 1988 and Mandel (1986). A gravity wave test using our solution strategy and hybridization preconditioner is illustrated in Fig. 7 for a problem on a condensed Earth (radius scaled down by a factor of 125) and 10 km lid.

Robustness against acoustic Courant number with implicit Coriolis
In this final experiment, we repeat a similar study to that presented in Mitchell and Müller (2016). Our setup closely resembles the gravity wave test of Skamarock and Klemp (1994) extended to a spherical annulus. We initialize the velocity in a simple solid-body rotation and introduce a localized buoyancy anomaly. A Coriolis term is introduced as a function of the Cartesian coordinate z and is constant along lines of latitude (f plane): 2 = 2 r z Rẑ , with angular rotation rate r = 7.292 × 10 −5 s −1 . We fix the resolution of the problem and run the solver over a range of t. We measure this by adjusting the horizontal acoustic Courant number λ C = c t x , where c is the speed of sound and x is the horizontal resolution.
Note that the range of Courant numbers used in this paper exceeds what is typical in operational forecast settings (typically between O(2)-O(10)). The grid setup mirrors that of actual global weather models; we extrude a spherical mesh of the Earth upwards to a height of 85 km. The setup for the different discretizations (including degrees of freedom for the velocity-pressure and hybridized systems) is presented in Table 6.
It was shown in Mitchell and Müller (2016) that using a sparse approximation of the pressure Schur complement of the form served as a good preconditioner, leading to a system that was amenable to multigrid methods and resulted in a Courantnumber-independent solver. However, when the Coriolis term is included, this is no longer the case: the diagonal approximation Diag( A u ) becomes worse with increasing λ C .
To demonstrate this, we solve the gravity wave system on a low-resolution grid (10 km lid, 10 vertical levels, maintaining the same cell aspect ratio as in Table 6) using the Schur complement factorization Eq. (106). LU factorizations are applied to invert both A −1 u and H −1 . Inverting the Schur complement H −1 is done using preconditioned GMRES iterations, and a flexible-GMRES algorithm is used on the full velocity-pressure system. If H −1 is a good approximation to H −1 , then we should see low iteration counts in the Schur complement solve. Figure 8 shows the results of this study for a range of Courant numbers.
For the lower-order methods, the number of iterations to invert H grows slowly but remains under control. Increasing the approximation degree by 1 results in degraded performance. As t increases, the number of Krylov iterations needed to invert the system to a relative tolerance of 10 −5 grows rapidly. It is clear that this sparse approximation is not robust against Courant number. This can be explained by the fact that diagonalizing the velocity operator fails to take into account the effects of the Coriolis term (which appear in off-diagonal positions in the operator). Even if one were to use traditional mass lumping (row-wise sums), the Coriolis effects are effectively canceled out due to asymmetry.
Hybridization avoids this problem entirely: we always construct an exact Schur complement and only have to worry about solving the trace system Eq. (111). We now show that this approach (described in Sect. 5.3.2) is much more robust to changes in t. We use the same workstation as for the three-dimensional CG/HDG problem in Sect. 5.1 (executed with a total of 40 MPI processes). Figure 9 shows the parameter test for all the discretizations described in Table 6. We see that, in terms of total number of GCR iterations needed Table 6. Grid setup and discretizations for the acoustic Courant number study. The total number of degrees of freedom (dofs) for the mixed (velocity and pressure) and hybridizable (velocity, pressure, and trace) discretizations are shown in the last two columns (millions). The vertical resolution is fixed across all discretizations.  Figure 8. Number of Krylov iterations to invert the Helmholtz system using H −1 as a preconditioner. The preconditioner is applied using a direct LU factorization within a GMRES method on the entire pressure equation. While the lowest-order methods grow slowly over the Courant number range, the higher-order (by only 1 approximation order) methods quickly degrade and diverge after the critical range λ C = O(2)-O(10). At λ C > 32, the solvers take over 150 iterations.

Discretizations and grid information
to invert the trace system, hybridization is far more controlled as Courant number increases. They largely remain constant throughout the entire parameter range, only varying by an iteration or two. It is not until after λ C > 32 that we begin to see a larger jump in the number of GCR iterations. This is expected, since the Coriolis operator causes the problem to become singular for very large Courant numbers. However, unlike with the approximate Schur complement solver, iteration counts are still under control. In particular, each method (lowest and higher order) remains constant throughout the critical range (shaded in gray in Fig. 9). In Fig. 9b, we display the ratio of execution time and the time-to-solution at the lowest Courant number of 2. We perform this normalization to better compare the lower and higher-order methods (and discretizations on triangular prisms vs. extruded quadrilaterals). The calculation of the ratios includes the time needed to eliminate and reconstruct the hybridized velocity-pressure variables. The fact that the hybridization solver remains close to 1 demonstrates that the entire solution procedure is largely λ C -independent until around λ C = 32. The overall trend is largely the same as the number of Krylov iterations to reach solver convergence. This is due to our hybridization approach being solver dominated, with local operations like forward elimination together with local recovery taking approximately one-third of the total execution time for each method. The percentage breakdown of the hybridization solver is similar to what is already presented in Sect. 5.1.2.
Implicitly treating the Coriolis term has been discussed for semi-implicit discretizations of large-scale geophysical flows (Temperton, 1997;Cullen, 2001;Nechaev and Yaremchuk, 2004). The addition of the Coriolis term presents a particular challenge to the solution finite element discretizations of these equations since it increases the difficulty of finding a good sparse approximation of the nonsymmetric elliptic operator. Hybridization shows promise here, as it allows for the assembly of the elliptic equation that both captures the effects of rotation and results in a sparse linear system.

Conclusions
We have presented Slate, and shown how this language can be used to create concise mathematical representations of localized linear algebra on the tensors corresponding to finite element forms. We have shown how this DSL can be used in tandem with UFL in Firedrake to implement solution approaches making use of automated code generation for static condensation, hybridization, and localized post-processing. Setup and configuration are done at runtime, allowing one to switch in different discretizations at will. In particular, this framework alleviates much of the difficulty in implementing such methods within intricate numerical code and paves the way for future low-level optimizations. In this way, the framework in this paper can be used to help enable the rapid development and exploration of new hybridization and static Figure 9. Courant number parameter test run on a fully loaded compute node. Both figures display the hybridized solver for each discretization, described in Table 6. Panel (a) displays to total iteration count (preconditioned GCR) to solve the trace system to a relative tolerance of 10 −5 . Panel (b) displays the relative work of each solver, which takes into account the time required to forward eliminate and locally recover the velocity and pressure.
condensation techniques for a wide class of problems. We remark here that the reduction of global matrices via elementwise algebraic static condensation, as described in Guyan (1965) and Irons (1965) is also possible using Slate, including other more general static condensation procedures outside the context of hybridization.
Our approach to preconditioner design revolves around its composable nature, in that these Slate-based implementations can be seamlessly incorporated into complicated solution schemes. In particular, there is current research in the design of dynamical cores for numerical weather prediction using implementations of hybridization and static condensation with Slate (Bauer and Cotter, 2018;Shipton et al., 2018). The performance of such methods for geophysical flows are a subject of ongoing investigation.
In this paper, we have provided some examples of hybridization procedures for compatible finite element discretizations of geophysical flows. These approaches avoid the difficulty in constructing sparse approximations of dense elliptic operators. Static condensation arising from hybridizable formulations can best be interpreted as producing an exact Schur complement factorization on the global hybridizable system. This eliminates the need for outer iterations from a suitable Krylov method to solve the full mixed system and replaces the original global mixed equations with a condensed elliptic system. More extensive performance benchmarks, which require detailed analysis of the resulting operator systems arising from hybridization, are a necessary next step to determine whether hybridization provides a scalable solution strategy for compatible finite elements in operational settings.
Code and data availability. The contribution in this paper is available through open-source software provided by the Firedrake Project: https://www.firedrakeproject.org/ (last access: 30 January 2020). We cite archives of the exact software versions used to produce the results in this paper. For all components of the Firedrake project used in this paper, see Zenodo/Firedrake (2019). The numerical experiments, full solver configurations, code verification (including local processing), and raw data are available in Zenodo/Tabula-Rasa (2019).
Author contributions. THG is the principal author and developer of the software presented in this paper and main author of the text. Authors LM and DAH assisted and guided the software abstraction as a domain-specific language and edited text. CJC contributed to the formulation of the geophysical fluid dynamics and the design of the numerical experiments and edited text.
Competing interests. David A. Ham is an executive editor of the journal. The other authors declare they have no other competing interests.