Geoscientific Model Development Implementation of multirate time integration methods for air pollution modelling

Explicit time integration methods are characterised by a small numerical effort per time step. In the application to multiscale problems in atmospheric modelling, this benefit is often more than compensated by stability problems and step size restrictions resulting from stiff chemical reaction terms and from a locally varying Courant-FriedrichsLewy (CFL) condition for the advection terms. Splitting methods may be applied to efficiently combine implicit and explicit methods (IMEX splitting). Complementarily multirate time integration schemes allow for a local adaptation of the time step size to the grid size. In combination, these approaches lead to schemes which are efficient in terms of evaluations of the right-hand side. Special challenges arise when these methods are to be implemented. For an efficient implementation, it is crucial to locate and exploit redundancies. Furthermore, the more complex programme flow may lead to computational overhead which, in the worst case, more than compensates the theoretical gain in efficiency. We present a general splitting approach which allows both for IMEX splittings and for local time step adaptation. The main focus is on an efficient implementation of this approach for parallel computation on computer clusters.


Introduction
Atmospheric processes can be described using advectiondiffusion-reaction equations.The advection term describes transport due to wind, diffusion describes turbulent mixing on spatial scales below the cell size.Both of these terms can efficiently be solved using explicit Runge-Kutta (RK) methods.So called Runge-Kutta-Chebychev methods were developed for the coupled treatment of advection-diffusion problems (Verwer, 1996), (Verwer et al., 2004).Depending on the specific simulation, the reaction terms may describe microphysical processes or chemical reactions of pollutants including source terms.These terms are usually stiff as characteristic times of involved processes differ significantly.Employing explicit methods is not efficient in this case as stability requirements limit the time step size to unpractically small values.Implicit methods have proven to be much more efficient for the chemistry problem.
Implicit/explicit (IMEX) splittings have been developed which allow for an efficient solution of advection-diffusionreaction equations.Complementarily since the 1980s explicit multirate methods have been developed (e.g., Osher and Sanders, 1983;Tang and Warnecke, 2005) which allow for efficient solution of problems which can be split into nonstiff sub-systems with differing characteristic times, for example, advection on locally refined grids.Generally, multirate schemes result only in a significant reduction of computational costs if the amount to compute the part which requires smaller time steps is low in comparison to the total effort.
In the majority of current atmospheric models simple operator splittings are employed while the usage of multirate methods is rare.In Schlegel et al. (2009Schlegel et al. ( , 2012) ) we presented a generic splitting approach which can be employed to construct a multirate-IMEX scheme.The current paper is concerned with the efficient implementation of this scheme in the state-of-the-art Multiscale Atmospheric Chemistry and Transport model COSMO-MUSCAT (Wolke et al., 2004;Hinneburg et al., 2009) developed at the Institute for Tropospheric Research in Leipzig.

Published by Copernicus Publications on behalf of the European Geosciences Union.
The remainder of this paper is structured as follows.First, we will mathematically outline a generic splitting scheme.The subsequent section is concerned with the focus of this paper, i.e., details of a practical implementation.We shall present the programme flow, details on data exchange and a balancing approach in separate subsections.Finally, we present two scenarios and discuss the obtained reduction of computational cost for each of them.

Mathematical preliminaries
In Schlegel et al. (2009) we presented a general splitting that may be employed to generate multirate methods, called recursive flux splitting multirate (RFSMR).The approach is based on an IMEX splitting presented by Knoth and Wolke (1998a), which we briefly outline here.These schemes can be seen as a higher-order generalisation of the popular source splitting approach (Verwer et al., 2002).Consider an equation w = F (w) + G(w).
In the context of this paper G represents advection with comparatively low Courant numbers.The other term, F , may represent diffusion-reaction or advection with higher Courant numbers.
Denote the time substeps in the explicit method by τ i = t 0 + tc i with c i monotonically increasing with i.Then the algorithm computing an approximate solution w 1 for time t 1 from an approximate solution w 0 at time t 0 can be outlined as follows τ ∈ τ i−1 , τ i , i = 2, ..., s + 1 , W i = v i (τ i ) , (5) with s denoting the number of stages of an explicit Runge-Kutta (ERK) method with parameters (a, b, c) in standard RK notation.Additionally r i denotes a source term correlated to the advection term G.For simplicity we define a s+1,j = b j , thus, avoiding a separate treatment of the summation stage.For stages i with c i = c i−1 we replace (3),...,(5) with a purely explicit step: which we will call a correction step.
Defining F (v) ≡ 0 we obtain the underlying explicit method, subsequently called the outer method.Generally we require w = G(w) to be non-stiff.Solving the inner system Eq.( 4) with an implicit integrator leads to an IMEX splitting.The system v = r i + F (v) then may be stiff.To obtain an explicit multirate method the inner system must be solved using an explicit Runge-Kutta method.In the latter context v = r i + F (v) is required to be non-stiff, but it may impose stricter time step restrictions than G.This situation arises e.g., when transport is simulated on a locally refined grid.To make a distinction possible, we denote the parameters of the outer method with a superscript "O" in contrast to the parameters of the inner method (employed for the solution of Eq. 4) denoted by a superscript "I".An explicit multirate method based on the above splitting then reads with the tilde parameters denoting the increments per RK stage for an explicit base method with s stages.Note that the time step ratio R, i.e., the ratio of the time step of the outer method (macro time step) to the time step of the inner method (micro time step), depends only on the nodes c O i of the outer method.Formally the inner method is executed s O times with time step sizes of . This is no severe limitation, however, as the inner method may for instance be replaced by a composition of n steps of size t/n to obtain the desired time step ratio, see Table 1.Finally the splitting may be applied recursively to obtain a multirate-IMEX splitting, see Schlegel et al. (2012).
Based on a partitioned Runge-Kutta (PRK) formulation (see for instance Hairer et al., 1987) we have proven that the methods constructed using the algorithm (1),...,(6) are second order accurate in time if all of the employed base methods are at least of second order accuracy (Schlegel et al., 2012).Even third order accuracy can be obtained if the base methods themselves satisfy third order conditions and one

Geosci
additional order condition is satisfied by the outer method, Third order accuracy in time has been documented by numerical tests and has also been proven formally.Order conditions for partitioned Runge-Kutta methods can be found for instance in Jackiewicz and Vermiglio (1998).
In order to apply this multirate approach to the advection equation, the advection operator must be split.Commonly a splitting by components is employed.Unfortunately the methods generated via RFSMR generally have unequal summation weights b fast = b slow (see Table 1 for an example), so that the methods do not preserve the linear invariants of the system such as the total mass of pollutants.Mass conservation, however, is a strict requirement for atmospheric models.The solution of this problem is to employ a splitting by fluxes.Applied to a decomposition of the domain into slow and fast blocks, flux splitting means that every block computes the fluxes leaving its cells.Thus, fast fluxes leaving cells with a stricter local Courant-Friedrichs-Lewy (CFL) restriction are updated more frequently per macro time step than slow fluxes.As the individual cell outfluxes are computed exactly the same way and based on the same concentration vector as the corresponding cell influxes, this kind of splitting guarantees mass conservation independent from the partitioned time integration scheme.

Implementation details
As the name already suggests the recursive flux splitting multirate algorithm is especially suitable for recursive implementation.This kind of implementation makes it simple to exploit redundancies.The primary aim of multirate methods is to reduce computational cost, usually measured in evaluations of the right-hand side.An objection obvious to programmers is that the bottleneck in modern hardware is memory access rather than the actual computations on the CPU.This holds the more if a multi-node cluster is employed and data has to be exchanged across the network.Fortunately RFSMR can be implemented with only little memory overhead; additionally communication workload is reduced.
The algorithm is implemented in the multi-scale atmospheric chemistry and transport model MUSCAT, (Wolke and Knoth, 2000;Knoth and Wolke, 1998b) developed at the Leibniz Institute for Tropospheric Research in Leipzig.This model is used for scientific process studies as well as online coupled with a meteorological model for several air quality applications in local and regional areas.MUSCAT describes the transport as well as chemical transformations for several gas phase species and particle populations in the atmosphere.The spatial discretisation of the mass balance equations is performed by finite volume techniques on a hierarchical grid structure.A second order IMEX scheme is applied for the time integration.The step size control during the implicit integration leads to load imbalances.Therefore, a dynamic workload balancing is implemented (Wolke et al., 2004).This is done using the ParMetis libraries (Karypis et al., 2011;Karypis, 1999).The MUSCAT code is mainly written in FORTRAN90/95 and uses a few additional C libraries.
This section is organised as follows: first, we shall explain how data is organised in our model, then present the programme flow of local computations and finally explain data exchange strategies.Since the workload balancing gets more complicated as the programme complexity increases, we will also comment on this issue.

Data organization and spatial structure
In MUSCAT data is organised hierarchically: the three dimensional computational domain is decomposed statically into rectangular blocks.This decomposition is applied in  horizontal direction only so that every block ranges from ground level through the top of the domain to simulate.An important feature is that each block may have its own spatial refinement level.Thus, selected regions may be examined in more detail.The cell size always is the macro cell size divided by an integer power of two.If two blocks are adjacent, their spatial refinement level is allowed to differ by a factor of two maximum.Cell size, adjacency to other blocks, temporal refinement level, etc., form the block meta data.Even for parallel processing meta data of all blocks is present on all processors.The main part of the data consists of concentration arrays and a number of difference arrays associated with the stage vectors of the employed explicit Runge-Kutta method.Opposed to the meta data, the main data of the block is present only on the processor the block is associated with.Additional variables include geometric data per cell (volume, extend per axis etc.) defined on initialisation and meteorological data such as wind speed or density of air.The latter may be provided by the COSMO model (Schättler et al., 2011;Steppeler et al., 2003) of the German weather service via online coupling or by a simple test driver.
The declaration of the concentration array is done in such a way that cell local values, i.e., the concentrations of the different tracers or species inside of a cell are directly adjacent in physical memory.The cell data in turn is organised so that cells within one column (i.e., cells at the same horizontal position) have minimal distance in memory.This layout is advantageous for the computations executed most frequently, i.e., cell local chemistry and column local (vertical) diffusion.A number of fully coupled implicit chemistry diffusion steps have to be performed per advection step.This part contributes mainly to the overall computational cost.All arrays have the same shape and dimension making vectorised operations possible.
We employ an extended array declaration for the individual blocks where the extended array includes both the actual cells of the block and the surrounding halo or ghost cells which are needed for coupling with adjacent blocks.Thus, all block local computations may be done on a logically cartesian array.The extended arrays are only used where necessary.For instance the difference arrays associated with the stage vectors are only defined for the inner cells.Additional data structures for the exchange of mass fluxes with  4) and ( 11).Employing a limited third order upwind spatial discretisation (Hundsdorfer and Verwer, 2003), the halo has to be one cell wide while the additional data structures overlap with the halo cells and the outermost row of actual cells.
Though the grid of the block is logically cartesian, its geometrical interpretation may be different.First of all, the simulation domain usually represents a volume above a spherical surface.Furthermore, perpendicularly to the interface the ghost cells have the extent of the actual cells they overlap with, see Fig. Case (d) only occurs at block corners and may be complicated for neighbour blocks with different spatial refinement levels.These different possible relations have to be taken into account when data is exchanged between blocks.Keep in mind that the domain is decomposed in the horizontal direction only, so that all of the above holds not only for cells in one vertical layer, but also for the columns, ranging from the bottom through the top of the simulation domain.

Programme flow
In this section we shall discuss the MUSCAT programme flow.For simplicity we shall concentrate on the main integration loop, omitting initialisation, finalisation and output routines.Algorithm (1),...,( 6) translates directly into an implementation.The following pseudo code evolves all blocks on the given time level from t 0 to t 0 + t.Here and subsequently we will employ the terms "time level" and "temporal refinement level" synonymously.In the pseudo code we also shortly write "level".The macro time step is equivalent to the lowest level.stage vectors are only defined for the inner cells.Additional data structures for the exchange of mass fluxes with neighbor blocks are needed along the respective boundaries, see Fig. 1.These data structures correspond to the source term r occurring in Eq. ( 4) and ( 11).Employing a limited third order upwind spatial discretization (Hundsdorfer and Verwer, 2003) the halo has to be one cell wide while the additional data structures overlap with the halo cells and the outermost row of actual cells.
Though the grid of the block is logically cartesian, its geometrical interpretation may be different.First of all the simulation domain usually represents a volume above a spherical surface.Furthermore perpendicularly to the interface the ghost cells have the extent of the actual cells they overlap with, see Fig.

1:
procedure RFSMR (t 0 , t, level) 2: for i = 2, ..., s + 1 Loop over Runge-Kutta stages 3: for all Processor-local blocks k do 4: if [timelevel of Block k]= level then 5: Compute local advective fluxes f i−1 6: exchange fluxes with neighbour blocks on same level r (equal) := r neighbour 9: r := r + r (equal) 10: end if 11: end for At this point, all advective fluxes on time level level are computed and known on all blocks containing the corresponding cell boundary.Note that for exchanges we write the variables without superscript if the variable is block local and with a superscript "neighbour" if the variable belongs to another block.The source term r (see line 7) corresponds to the source term r i as mathematically defined in Eq. ( 2).Since source terms computed during earlier explicit stages are not needed anymore, we employ a single variable.Source terms r (slow) computed on a lower time level are defined before the routine is called.Now a case distinction has to be made, whether or not the forward step in time associated with the current Runge-Kutta stage t (c i − c i−1 ) is greater than zero.
www.geosci-model-dev.net/5/1395/2012/Geosci.Model Dev., 5, 1395-1405, 2012 12: if there are blocks on a higher level then 14: substeps: = 2(c i − c i−1 ) 15: send weighted boundary fluxes to neighbours on higher level: r neighbour (slow) In this pseudo code we assume that DIFFRE-ACT(Y, r, τ 0 , τ ) solves the initial value problem given by with F representing a time dependent diffusion-reaction term and stores the result c(τ 0 + τ ) in the variable Y .The number of substeps to be taken on the next higher time level can be chosen dynamically in line 14.This allows us to use a different base method in the next level to ensure a time step ratio of 2 between successive time levels.For the method (RK2a) given in Table 2 two substeps will be employed between the first and second stage, while no step will be taken between the second stage and the summation stage; this is equivalent to the formal construction as shown in Table 1.For (RK2b) one step on the next temporal level will be done both for the second stage and the summation stage, again leading to a time step ratio of two.Generally this time step ratio is possible for any explicit Runge-Kutta method with monotonically increasing nodes c such that ∀i : The above synopsis of the actual implementation already shows that due to the recursive calls the programme flow will naturally be structured into multiple phases corresponding to the different time levels.This fact significantly complicates balancing.Furthermore, there are different kinds of exchanges which shall be discussed in the following section.

Data exchange
In the above listed pseudo code, the following data exchanges are mentioned.Exchanges of fluxes have to be performed between blocks on one level (line 8) or from a lower to a higher time level (line 15).Exchanges of concentrations have to be performed between blocks on a single level (line 23, 30) or from a higher to a lower level (line 31).While the expression "exchange of fluxes" is illustrative it is not exact.Stored and exchanged are not the advective fluxes across cell boundaries, but the time derivatives of the concentrations per cell computed from the net fluxes.If the boundary fluxes are cast as f i±1/2,j and f i,j ±1/2 with spatial indexes i, j and cell volume V i,j , then the concentration tendency due to advection reads Due to the data structure, i.e., the data available for a block and the algorithm employed for the calculation of block local fluxes, each boundary flux is computed exactly once and sent to the neighbouring block.It is convenient to store the received time derivatives in an additional variable allowing for a distinction between fluxes given by neighbours on the same and on a lower temporal refinement level.Exchange of concentrations always overwrites the previous concentration of the receiving cell.Considering the geometrical aspects of the various exchanges it is important to ensure that multiple representations of the same physical volume contain equivalent data.As all exchanged quantities are either cell average values or time derivatives of cell average values this can be accomplished by averaging and constant interpolation of these quantities.We interpret the cell volumes V ⊂ Ω as subsets of the simulation domain Ω.Further we denote the original quantity and the original volume as given by the sending block with a superscript "S" and the copied quantity of the receiving block and the correlated volume with a superscript "R".The exchanges for a generic quantity q for different configurations then read: corresponding to the cell relations (a), (b) and (c) as illustrated in Fig. 2.More complicated relations hold for diagonal exchanges.For a better understanding see Fig. 4.There a flux across a specific cell boundary is computed from data present in the upper left block.This flux in turn is represented by a source term marked "+" and a sink term marked "-".As the same physical region is represented 4 times these source and sink terms have to be exchanged to the other blocks shown.The exchange is complicated by the fact that the source term influences "half a ghost cell" of the lower right block.For this specific configuration this problem can be solved by adding half of the source term to the ghost cell.
If as in our implementation the spatial resolution of directly (i.e.not diagonally) adjacent blocks is allowed to differ by a factor of two maximum, nine different configurations of diagonal overlaps have to be distinguished.Note that diagonal exchanges are obsolete if all blocks have the same temporal refinement level, i.e. a classical time integration scheme is employed.
For efficient parallel execution it is desirable to minimize communication cost.Generally it is more efficient to exchange one big chunk of data instead of several small chunks of equal cumulative size.For this reason inter process exchange is implemented as a gathering or packing of data, exchange using MPI routines and finally unpacking of data.As meta data regarding all blocks including adjacency information is present on all processors every participant of an exchange knows a priori which data to send and/or to receive.To reduce the amount of data to be exchanged interpolation and averaging is done in such a way that as little data as possible is to be transferred.This means that averaging is done before packing while interpolation is done after unpacking.

Balancing
If a simulation is to be distributed on multiple cores of one processor, multiple processors or even multiple nodes of a computing cluster, the simulation has to be split in several parts.In the context of air pollution modeling this means a decomposition of the simulation domain into blocks.The available computing elements are then to be assigned to these blocks such that idle times are minimized.For classical time Fig. 3. Exchanges in course of a macro time step method based on RK2a, see also Table 1.The c O i and c I i denote the nodes of the inner and outer base method.updated from their geometrical counterparts.The rationale for this is that the halo cells of the sending block contain a better approximation than the receiving block's outer cells.An illustration of the exchanges in the course of one macro time step is given in Fig. 3.The last exchange is necessary to update the higher level block's outer cells and halo with the result of the correction stage performed on the macro time level.
Considering the geometrical aspects of the various exchanges, it is important to ensure that multiple representations of the same physical volume contain equivalent data.As all exchanged quantities are either cell average values or time derivatives of cell average values, this can be accomplished by averaging and constant interpolation of these quantities.We interpret the cell volumes V ⊂ as subsets of the simulation domain .Further we denote the original quantity and the original volume as given by the sending block with a superscript "S" and the copied quantity of the receiving block and the correlated volume with a superscript "R".The exchanges for a generic quantity q for different configurations then read: Exchanges in course of a macro time step for method based 2a, see also Table 1.The c O i and c I i denote the nodes of the nd outer base method.
sidering the geometrical aspects of the various exes it is important to ensure that multiple representations same physical volume contain equivalent data.As all nged quantities are either cell average values or time tives of cell average values this can be accomplished eraging and constant interpolation of these quantities.terpret the cell volumes V ⊂ Ω as subsets of the simudomain Ω.Further we denote the original quantity and iginal volume as given by the sending block with a suipt "S" and the copied quantity of the receiving block e correlated volume with a superscript "R".The exes for a generic quantity q for different configurations ead: blocks shown.The exchange is complicated by the fact th the source term influences "half a ghost cell" of the low right block.For this specific configuration this problem c be solved by adding half of the source term to the ghost ce If as in our implementation the spatial resolution of direc (i.e.not diagonally) adjacent blocks is allowed to differ by factor of two maximum, nine different configurations of agonal overlaps have to be distinguished.Note that diagon exchanges are obsolete if all blocks have the same tempo refinement level, i.e. a classical time integration scheme employed.
For efficient parallel execution it is desirable to minimi communication cost.Generally it is more efficient to e change one big chunk of data instead of several small chun of equal cumulative size.For this reason inter process e change is implemented as a gathering or packing of data, e change using MPI routines and finally unpacking of data.meta data regarding all blocks including adjacency inform tion is present on all processors every participant of an e change knows a priori which data to send and/or to receiv To reduce the amount of data to be exchanged interpolati and averaging is done in such a way that as little data as po sible is to be transferred.This means that averaging is do before packing while interpolation is done after unpacking

Balancing
corresponding to the cell relations (a), (b) and (c) as illustrated in Fig. 2.More complicated relations hold for diagonal exchanges.For a better understanding see Fig. 4. A flux across a specific cell boundary is computed from data present in the upper left block.This flux in turn is represented by a source term marked "+" and a sink term marked "-".As the same physical region is represented 4 times, these source and sink terms have to be exchanged to the other blocks shown.
The exchange is complicated by the fact that the source term influences "half a ghost cell" of the lower right block.For this specific configuration, this problem can be solved by adding half of the source term to the ghost cell.If as in our implementation the spatial resolution of directly (i.e., not diagonally) adjacent blocks is allowed to differ by a factor of two maximum, nine different configurations of diagonal overlaps have to be distinguished.Note that diagonal exchanges are obsolete if all blocks have the same temporal refinement level, i.e., a classical time integration scheme is employed.
For efficient parallel execution it is desirable to minimise communication cost.Generally it is more efficient to exchange one big chunk of data instead of several small chunks of equal cumulative size.For this reason inter process www.geosci-model-dev.net/5/1395/2012/Geosci.Model Dev., 5, 1395-1405, 2012 exchange is implemented as a gathering or packing of data, exchange using MPI routines and finally unpacking of data.
As meta data regarding all blocks including adjacency information is present on all processors every participant of an exchange knows a priori which data to send and/or to receive.
To reduce the amount of data to be exchanged interpolation and averaging is done in such a way that as little data as possible is to be transferred.This means that averaging is done before packing while interpolation is done after unpacking.

Balancing
If a simulation is to be distributed on multiple cores of one processor, multiple processors or even multiple nodes of a computing cluster, the simulation has to be split in several parts.In the context of air pollution modelling this means a decomposition of the simulation domain into blocks.The available computing elements are then to be assigned to these blocks such that idle times are minimised.For classical time integration schemes, this can easily be implemented by performing workload balancing.Thus, it can be provided that every processor has approximately the same amount of work to do.The Metis/ParMetis libraries (Karypis et al., 2011) provide sophisticated balancing algorithms.Balancing problems are interpreted as a class of graph theoretical problems which may be subsumed as "minimise the edge cut without violating node balancing constraints".Blocks of the decomposed simulation domain are mapped to nodes of the graph, necessary data exchanges are mapped to the edges connecting these nodes.Consequently minimising the edge cut is equivalent to minimising the communication between partitions or processors.
The more complex programme flow in the multirate context calls for a more sophisticated approach to balancing.Naive workload balancing is not sufficient to minimise idle times, as the programme flow is structured in multiple phases due to recursive calls, see Fig. 5 for an example.Assuming that the same computational cost is associated with all four mentioned blocks, both presented block distributions are optimal in the sense of workload balancing.However, the worst case distribution will lead to an unnecessarily high amount of idle times.
Optimally not only the overall workload is balanced, but also the workload for each phase, i.e., for each temporal refinement level.The Metis/ParMetis libraries offer the possibility of multi constraint balancing, (Karypis, 1999).Classical balancing (single constraint) considers one scalar weight per node and aims at an even distribution of this scalar weight within a margin of tolerance.Opposed to this multiconstraint balancing considers a vector of weights for each node and aims at an even distribution for each vector dimension.For our algorithm that naively means that the weight vector w for a block reads 8 Setup: -Blocks #1 and #2 on temporal refinement level 0 -Blocks #3 and #4 on temporal refinement level 1.
- integration schemes this can easily be implemented by performing workload balancing.Thus it can be provided that every processor has approximately the same amount of work to do.The Metis/ParMetis libraries (Karypis et al., 2003) provide sophisticated balancing algorithms.Balancing problems are interpreted as a class of graph theoretical problems which may be subsumed as "minimize the edge cut without violating node balancing constraints".Blocks of the decomposed simulation domain are mapped to nodes of the graph, necessary data exchanges are mapped to the edges connecting these nodes.Consequently minimizing the edge cut is equivalent to minimizing the communication between partitions or processors.
The more complex program flow in the multirate context calls for a more sophisticated approach to balancing.Naive workload balancing is not sufficient to minimize idle times, as the program flow is structured in multiple phases due to recursive calls, see Fig. 5 for an example.Assuming that the same computational cost is associated with all four mentioned blocks, both presented block distributions are optimal in the sense of workload balancing.However the worst case distribution will lead to an unnecessarily high amount of idle times.
Optimally not only the overall workload is balanced but also the workload for each phase, i.e. for each temporal refinement level.The Metis/ParMetis libraries offer the possibility of multi constraint balancing, (Karypis, 1999).Classical balancing (single constraint) considers one scalar weight per node and aims at an even distribution of this scalar weight within a margin of tolerance.Opposed to this multi con- with N denoting the number of time levels throughout the simulation, L the block's time level and C the number of columns within the block.Choosing this approach will probably not lead to satisfactory results as the constraints leave only little margin for optimisation.As a compromise we employ three constraints correlated to the highest temporal refinement level L max , the second highest temporal refinement level and the remaining levels.The rationale for this is that in most setups the two highest refinement levels will cause the bigger part of computational cost.Consequently balancing blocks on these levels will lead to an acceptable tradeoff between constraints and idle times.We define the three dimensional weight vector as follows: The factor 2 L in the third case is needed to consider the relative computational cost of blocks on potentially different time levels.The scaling for the first two components of the weighting vector is not necessary, as only their relative weights are taken into account by the ParMetis library.To prioritise balancing of the highest temporal refinement levels over the remaining levels, a smaller and, thus, stricter margin of tolerance is provided for the first component of the weight vector.  1 Not to be confused with the cost reduction due to application of a multirate scheme.

Geosci
case.These tests are run on different numbers of processors each.We compare the computational cost of the multirate approach to the computational cost without temporal refinement, denoted singlerate.Results are shown in Fig. 7.
As the time step is chosen to be directly proportional to the grid size, the naive cost reduction is approximately 52%.For a pure advection problem we achieve an average multirate cost reduction of 59%.Here the exceeding of the naive cost reduction can be explained by reduced communication.The behavior for the full advection-diffusion reaction system however is not intuitive.In the latter case the multirate approach is applied only to the advection operator whose evaluation contributes about 1 % to the total computational cost.However there still is a significant cost reduction of about 36 %.The reason for this is the behavior of the term solved implicitly depending on the source term r, see Eq. ( 4).Each update of this source term introduces a discontinuity in the right hand side of the equation.In combination with the error control of the second order implicit solver this causes smaller implicit steps or even makes expensive restarts necessary (Knoth and Wolke, 1998a).Fewer updates take place if the outer system is solved using a larger time step, thus indirectly improving the efficiency of the implicit solving.

Realistic test case
The following test case is taken from a earlier study performed by Hinneburg et al. (2009), examining the effects of  Computational tests with realistic scenarios show ambivalent results.While multi-constraint balancing is suitable to minimise idle times during computation, it generally leads to higher communication cost, as the optimisation of communication is hindered by more constraints.Thus, it is generally recommendable for such simulations in which local computations take significantly more time than data exchange, e.g., simulations involving computationally expensive chemistry or microphysics.If in contrast to that a simulation is communication dominated, relatively little parallelisation speedup can be expected even for optimal distribution of blocks on different processors.

Results
The implementation described above was tested with academic and realistic scenarios.Results show good agreement of the solutions obtained with the multirate splitting and classical time integration.We shall present two test cases.The first test is designed to make optimal use of the multirate approach by employing a grid with a small region of interest and a homogeneous, diagonal wind field.The other test case M. Schlegel et al.: Implementation of splitting methods 9 and a homogeneous, diagonal wind field.The other test case is taken from an earlier study, with a realistic wind field provided by the COSMO model (Hinneburg et al., 2009).Results of the latter case shall demonstrate the potential of multirate schemes for realistic scenarios as well as show remaining deficiencies.

Academic test case
An important characteristic of parallel programs is the speedup 1 when solving the problem on multiple processors.For the scenario discussed here we observed not quite an ideal (i.e.linear) speedup, but the overhead is small enough to justify parallel execution.Furthermore due to a sophisticated balancing approach making use of ParMetis' multi-constraint partitioning capabilities, the parallelization speedup is comparable to the one obtained for the much simpler case of singlerate time integration.The domain is quadratic in horizontal direction.A comparatively small region is refined, see Table 3 and Fig In this domain we tested two kinds of model equations: a pure advection with a uniform wind field and advectiondiffusion-reaction with the same wind field, vertical diffusion and a chemistry model involving point sources in the finest region and 258 different chemical reactions of 98 reactants.The overall computational cost of the full system is about a factor of 100 larger than that of the pure advection case.These tests are run on different numbers of processors each.We compare the computational cost of the multirate approach to the computational cost without temporal refinement, denoted singlerate.Results are shown in Fig. 7.
As the time step is chosen to be directly proportional to the grid size, the naive cost reduction is approximately 52%.For a pure advection problem we achieve an average multirate cost reduction of 59%.Here the exceeding of the naive cost reduction can be explained by reduced communication.The behavior for the full advection-diffusion reaction system however is not intuitive.In the latter case the multirate approach is applied only to the advection operator whose evaluation contributes about 1 % to the total computational cost.However there still is a significant cost reduction of about 36 %.The reason for this is the behavior of the term solved implicitly depending on the source term r, see Eq. ( 4).Each update of this source term introduces a discontinuity in the right hand side of the equation.In combination with the error control of the second order implicit solver this causes smaller implicit steps or even makes expensive restarts necessary (Knoth and Wolke, 1998a).Fewer updates take place if the outer system is solved using a larger time step, thus indirectly improving the efficiency of the implicit solving.

Realistic test case
The following test case is taken from a earlier study performed by Hinneburg et al. (2009), examining the effects of is taken from an earlier study, with a realistic wind field provided by the COSMO model (Hinneburg et al., 2009).Results of the latter case shall demonstrate the potential of multirate schemes for realistic scenarios as well as show remaining deficiencies.

Academic test case
An important characteristic of parallel programmes is the speedup 1 when solving the problem on multiple processors.For the scenario discussed here we observed not quite an ideal (i.e., linear) speedup, but the overhead is small enough to justify parallel execution.Furthermore, due to a sophisticated balancing approach making use of ParMetis' multi-constraint partitioning capabilities, the parallelisation speedup is comparable to the one obtained for the much simpler case of single-rate time integration.The domain is quadratic in horizontal direction.A comparatively small region is refined, see Table 3 and Fig. 6.Sources are located within the region most finely resolved.
In this domain, we tested two kinds of model equations: a pure advection with a uniform wind field and advectiondiffusion-reaction with the same wind field, vertical diffusion and a chemistry model involving point sources in the finest region and 258 different chemical reactions of 98 reactants.The overall computational cost of the full system is about a factor of 100 larger than that of the pure advection case.These tests are run on different numbers of processors each.We compare the computational cost of the multirate approach to the computational cost without temporal refinement, denoted single-rate.Results are shown in Fig. 7.
As the time step is chosen to be directly proportional to the grid size, the naive cost reduction is approximately 52 %.

350m
≈1.4% 3904 ≈16.3% Again we employ a grid with four levels of refinement with small, highly resolved regions around the power plants, totalling about 1 % of the overall area, see Table 4 and Fig. 8.The high resolution in this context is chosen to ensure an accurate description of the near field chemistry around the computational domain.A more sophisticated time s lection based on the characteristic times of the ind blocks rather than based solely on the spatial resoluti be constructed with relative ease.Complications ar parallel execution -if the temporal refinement leve block is changed, a redistribution of the blocks is nec This holds for either of the balancing approaches dis in Sect.3.4.
A further defect concerns the parallelization sp while the problem scales well for up to eight processo ploying sixteen processors yields only very little im ment.At least in part this results from inhomogeneiti to the distribution of the point sources.The stronges sources are inhomogeneously distributed on the block the highest temporal and spatial refinement level.E the point sources induces significantly increased cost chemistry solver.For up to eight processors the block taining the point sources are distributed evenly on all p sors; for more processors this is not the case.This prob even more complicated due to the temporally varying field.Due to higher concentrations the speed of chemi actions inside of the plume is significantly higher than air.If the plume is transported into a previously emp computational cost of this cell is increased for the nex step.This effect can not easily be taken into consid a priori.However employing dynamic repartitioning on the measured workload per block seems to be a pro way to tackle this problem.
Since the correction of both of the mentioned defe volves the implementation of a complex repartitionin tine it seems recommendable to implement a conjunct lution.

Conclusions
In this paper we presented details on an efficient imp tation of a general splitting approach.This approach ployed to obtain a multirate-IMEX splitting, i.e. adv for different physical domains is solved with differe plicit time steps depending on the grid size while diff reaction equations are solved implicitly.We have show the presented implementation is efficient in the sen For a pure advection problem, we achieve an average multirate cost reduction of 59 %.Here the exceeding of the naive cost reduction can be explained by reduced communication.The behaviour for the full advection-diffusion reaction system, however, is not intuitive.In the latter case, the multirate approach is applied only to the advection operator whose evaluation contributes about 1 % to the total computational cost.However, there still is a significant cost reduction of about 36 %.The reason for this is the behaviour of the term solved implicitly depending on the source term r, see Eq. ( 4).Each update of this source term introduces a discontinuity in the right-hand side of the equation.In combination with the error control of the second order implicit solver this causes smaller implicit steps or even makes expensive restarts necessary (Knoth and Wolke, 1998a).Fewer updates take place if the outer system is solved using a larger time step, thus, indirectly improving efficiency of the implicit solving.

Realistic test case
The following test case is taken from a earlier study performed by Hinneburg et al. (2009), examining the effects of emissions of two power plants in Germany.One plant is located near Lippendorf, the other near Boxberg, both in the federal country Saxony.Emissions are modelled by point sources which for the larger part represent the chimneys of the power plants, and area sources representing the emissions according to land usage.Meteorological data is provided by the COSMO model via online coupling.
Again we employ a grid with four levels of refinement with small, highly resolved regions around the power plants, totalling about 1 % of the overall area, see Table 4 and Fig. 8.The high resolution in this context is chosen to ensure an accurate description of the near field chemistry around the power plants by reducing numerical diffusion.Exactly equal parts of the total area are on the coarsest and second coarsest  Again we employ a grid with four levels of refinement with small, highly resolved regions around the power plants, totalling about 1 % of the overall area, see  Assuming a homogeneous wind field we would expect a slightly higher reduction of computational cost as for the academic test case with equal diffusion-reaction setup, as a smaller fraction of cells is on the finest refinement level.The actually obtained cost reduction is lower, see also Fig. 9.This results from the fact that the wind fields provided by the COSMO model exhibit strong variability in all of the computational domain.A more sophisticated time step selection based on the characteristic times of the individual blocks rather than based solely on the spatial resolution can be constructed with relative ease.Complications arise for parallel execution -if the temporal refinement level of a block is changed, a redistribution of the blocks is necessary.This holds for either of the balancing approaches discussed in Sect.3.4.
A further defect concerns the parallelisation speedup: while the problem scales well for up to eight processors, employing sixteen processors yields only very little improvement.At least in part this results from inhomogeneities due to the distribution of the point sources.The strongest point sources are inhomogeneously distributed on the blocks with the highest temporal and spatial refinement level.Each of the point sources induces significantly increased cost for the chemistry solver.For up to eight processors, the blocks containing the point sources are distributed evenly on all processors; for more processors this is not the case.This problem is even more complicated due to the temporally varying wind field.Due to higher concentrations the speed of chemical reactions inside of the plume is significantly higher than in free air.If the plume is transported into a previously empty cell, computational cost of this cell is increased for the next time step.This effect can not easily be taken into consideration a priori.However, employing dynamic repartitioning based on the measured workload per block seems to be a promising way to tackle this problem.
Since the correction of both of the mentioned defects involves the implementation of a complex repartitioning routine, it seems recommendable to implement a conjunctive solution.

Conclusions
In this paper, we presented details on an efficient implementation of a general splitting approach.This approach is employed to obtain a multirate-IMEX splitting, i.e., advection for different physical domains is solved with different explicit time steps depending on the grid size while diffusionreaction equations are solved implicitly.We have shown that the presented implementation is efficient in the sense that a good fraction of the theoretical speedup can be obtained practically.As practical tests have shown even the efficiency of the implicit solving is improved due to synergetic effects.A reasonable parallel speedup can be achieved by employing the multi-constraint balancing approach as described in Sect.3.4.
Further work should include a more sophisticated selection of the temporal refinement level, as opposed to choosing the temporal refinement level equal to the spatial refinement level.Additionally, the system could be improved by implementation of dynamic repartitioning.

Fig. 1 .
Fig. 1.Illustration of block structure for simulation domain as well as one generic block in MUSCAT.
needed along the respective boundaries, see Fig.1.These data structures correspond to the source term r occurring in Eqs. ( 2. The possible relations between actual cells and ghost cells representing the same physical volume are also marked in Fig. 2. The respective volume may be represented by: a. one inner cell and one ghost cell, b. two inner cells and one ghost cell of a coarser neighbour, c. one inner cell and two ghost cells of a finer neighbour or d. inner cell(s) and ghost cell(s) for each of two neighbours.

Fig. 2 .
Fig. 2. Geometrical cell structure of adjacent blocks.Halo cells depicted with thinner contours.Connectors between blocks indicate multiple representations of the same physical volume.
2. The possible relations between actual cells and ghost cells representing the same physical volume are also marked in Fig. 2 .The respective volume may be represented by: a) one inner cell and one ghost cell, b) two inner cells and one ghost cell of a coarser neighbor, c) one inner cell and two ghost cells of a finer neighbor or d) inner cell(s) and ghost cell(s) for each of two neighbors.

Fig. 2 .
Fig. 2. Geometrical cell structure of adjacent blocks.Halo cells depicted with thinner contours.Connectors between blocks indicate multiple representations of the same physical volume.
While flux exchange always involves both the outermost actual cells and the halo cells, exchange of concentrations may occur in different ways.If concentrations are exchanged on a single time level, the exchange is a copy from the outermost actual cells of the sender to the halo cells of the receiver, possibly complicated by inter process communication.If on the other hand concentrations are sent to a block on a different time level concentration from both ghost cells and outermost actual cells are Geosci.Model Dev., 5, 1395-1405, 2012 www.geosci-model-dev.net/5

Fig. 4 .
Fig. 4. Multiple representation of the same physical region (marked by red rectangles) and source/sink terms caused by a specific advective flux.

Fig. 4 .
Fig. 4. Multiple representation of the same physical region (mark by red rectangles) and source/sink terms caused by a specific adv tive flux.

Fig. 4 .
Fig. 4. Multiple representation of the same physical region (marked by red rectangles) and source/sink terms caused by a specific advective flux.

Fig. 5 .
Fig. 5. Parallel programme flow for different distributions of four blocks on two processors.Thick lines indicate exchanges.

Fig. 6 .
Fig. 6.Illustration of the spatial structure for academic test run.Hatchings correspond to different grid sizes.

Fig. 6 .
Fig. 6.Illustration of the spatial structure for academic test run.Hatchings correspond to different grid sizes.

Fig. 7 .
Fig. 7.Computational cost for academic test case.The displayed simulation time was normalised such that the single-rate setup on one processor corresponds to unit.Actual computational cost of the pure advection setup is about 1 % of the full setup.

Fig. 8 .
Fig. 8. Illustration of the spatial structure for realistic test run.Hatchings correspond to different grid sizes.

Fig. 9 .
Fig. 9. Computational cost for realistic test case.The displayed simulation time was normalized such that the singlerate setup on one processor corresponds to unit.

Fig. 9 .
Fig. 9. Computational cost for realistic test case.The displayed simulation time was normalised such that the single-rate setup on one processor corresponds to unit.

Table 1 .
RFMSMR(RK2a) -an example for a 2nd order multirate scheme with time step ratio R = 2; redundant stages omitted.

Table 2 .
Examples of two stage, second order explicit Runge-Kutta methods Equal workload per block.Fig. 5. Parallel program flow for different distributions of four blocks on two processors.Thick lines indicate exchanges.

Table 3 .
Synopsis of spatial structure for academic test case.

Table 3 .
Synopsis of spatial structure for academic test case.

Table 3 .
. 6. Sources are located within the region most finely resolved.Synopsis of spatial structure for academic test case.Computational cost for academic test case.The displayed simulation time was normalized such that the singlerate setup on one processor corresponds to unit.Actual computational cost of the pure advection setup is about 1% of the full setup.

Table 4 .
Synopsis of spatial structure for realistic test case.Computational cost for realistic test case.The displayed simulation time was normalized such that the singlerate setup on one processor corresponds to unit.

Table 4 .
Synopsis of spatial structure for realistic test case.Illustration of the spatial structure for realistic test run.Hatchings correspond to different grid sizes.
Table 4 and Fig. 8.The high resolution in this context is chosen to ensure an accurate description of the near field chemistry around the