The Parallelized Large-Eddy Simulation Model (PALM) version 4.0 for atmospheric and oceanic ﬂows: model formulation, recent developments, and future perspectives

. In this paper we present the current version of the Parallelized Large-Eddy Simulation Model (PALM) whose core has been developed at the Institute of Meteorology and Climatology at Leibniz Universität Hannover (Germany). PALM is a Fortran 95-based code with some Fortran 2003 extensions and has been applied for the simulation of a variety of atmospheric and oceanic boundary layers for more than 15 years. PALM is optimized for use on massively parallel computer architectures and was recently ported to general-purpose graphics processing units. In the present paper we give a detailed description of the current version of the model and its features, such as an embedded Lagrangian cloud model and the possibility to use Cartesian topography. Moreover, we discuss recent model developments and future perspectives for LES applications.


Introduction
In meteorology, large-eddy simulation (LES) has been used since the early 1970s for various research topics on turbulent flows at large Reynolds numbers in the atmospheric boundary layer (ABL).The first investigations using LES were performed by Lilly (1967) and Deardorff (1973Deardorff ( , 1974)).Nowadays, thanks to the increasing power of modern supercomputers, the technique is well known and widely spread within the boundary-layer meteorology community (see the review in Mason, 1994).Numerous studies in boundary-layer re-search that made use of LES have been published since then, with gradually increasing model resolution over the years (Moeng, 1984;Mason, 1989;Wyngaard et al., 1998;Sullivan et al., 1998;Sorbjan, 2007;Maronga, 2014, among many others).A detailed intercomparison of different LES codes can be found in Beare et al. (2006).LES models solve the three-dimensional (3-D) prognostic equations for momentum, temperature, humidity, and other scalar quantities.The principle of LES is based on the separation of scales.Turbulence scales that are larger than a certain filter width are directly resolved, whereas the effect of smaller scales is parametrized by a subgrid-scale (SGS) turbulence model.As the bulk part of the energy is contained in the large eddies, about 90 % of the turbulence energy can be resolved by means of LES (e.g., Heus et al., 2010).In practice, the filter width often depends on the grid resolution and therefore on the phenomenon that is studied.Typical filter widths can thus range from 50-100 m for phenomena on a regional scale like Arctic cold-air outbreaks (e.g., Gryschka and Raasch, 2005) down to 0.5-2 m for LES of the urban boundary layer with very narrow streets (e.g., Kanda et al., 2013), or for simulations of the stable boundary layer (e.g., Beare et al., 2006).
In this overview paper we describe the Parallelized LES Model (PALM) whose core has been developed at the Institute of Meteorology and Climatology (IMUK) at Leibniz Universität Hannover (Germany).The model is based on the non-parallelized LES code described by Raasch and Etling (1991).The parallelized version was developed about 6 years later and its first formulation can be found in Raasch and Schröter (2001).Therewith, PALM was one of the first parallelized LES models for atmospheric research at all.Many people have helped in developing the code further over the past 15 years, and large parts of the code have been added, optimized and improved since then.For example, embedded models such as a Lagrangian cloud model (LCM) as part of a Lagrangian particle model (LPM) and a canopy model have been implemented.Also, an option for Cartesian topography is available.Moreover, the original purpose of the model to study atmospheric turbulence was extended by an option for oceanic flows.Thus, Raasch and Schröter (2001) can no longer be considered an adequate reference for current and future research articles.
In the present paper we will provide a comprehensive description of the current version 4.0 of PALM.The idea for this overview paper was also partly inspired by Heus et al. (2010), who gave a detailed description of the Dutch Atmospheric Large-Eddy Simulation (DALES) model.
In the course of the release of PALM 4.0, a logo was designed, showing a palm tree -a reference to the acronym PALM (see Fig. 1).
Over the last 15 years, PALM has been applied for the simulation of a variety of boundary layers, ranging from heterogeneously heated convective boundary layers (e.g., Raasch and Harbusch, 2001;Letzel and Raasch, 2003;Maronga and Raasch, 2013), urban canopy flows (e.g., Park et al., 2012;Kanda et al., 2013), and cloudy boundary layers (e.g., Riechelmann et al., 2012;Hoffmann et al., 2015;Heinze et al., 2015).Moreover, it has been used for studies of the oceanic mixed layer (OML, e.g., Noh et al., 2010Noh et al., , 2011) ) and recently for studying the feedback between atmosphere and ocean by Esau (2014).PALM also participated in the first intercomparison of LES models for the stable boundary layer, as part of the Global Energy and Water Cycle Experiment Atmospheric Boundary Layer Study initiative (GABLS, Beare et al., 2006).In this experiment, PALM was for the first time successfully used with extremely fine grid spacings of down to 1 m.From the very beginning, PALM was designed and optimized to run very high-resolution setups and large model domains efficiently on the world's biggest supercomputers.
The paper is organized as follows: Sect. 2 deals with the description of the model equations, numerical methods and parallelization principles.Section 3 describes the embedded models such as cloud physics, canopy model and LPM, followed by an overview of the technical realization (Sect.4).In Sect. 5 we will outline topics of past applications of PALM and discuss both upcoming code developments and future perspectives of LES applications in general.Section 7 gives a summary.

Model formulation
In this section we will give a detailed description of the model.We will confine ourselves to the atmospheric formulation and devote a separate section (see Sect. 2.7) to the ocean option.By default, PALM has six prognostic quantities: the velocity components u, v, w on a Cartesian grid, the potential temperature θ , specific humidity q v or a passive scalar s, and the SGS turbulent kinetic energy (SGS-TKE) e.The separation of resolved scales and SGS is implicitly achieved by averaging the governing equations (see Sect. 2.1) over discrete Cartesian grid volumes as proposed by Schumann (1975).Moreover, it is possible to run PALM in a direct numerical simulation mode by switching off the prognostic equation for the SGS-TKE and setting a constant eddy diffusivity.For a list of all symbols and parameters that we will introduce in Sect.2.1, see Tables 1 and 2.

Governing equations
The model is based on the non-hydrostatic, filtered, incompressible Navier-Stokes equations in Boussinesqapproximated form.In the following set of equations, angle brackets denote a horizontal domain average.A subscript 0 indicates a surface value.Note that the variables in the equations are implicitly filtered by the discretization (see above), but that the continuous form of the equations is used here for convenience.A double prime indicates SGS variables.The overbar indicating filtered quantities is omitted for readability, except for the SGS flux terms.The equations for the conservation of mass, energy and moisture, filtered over a grid volume on a Cartesian grid, then read as ∂s ∂t = − ∂u j s ∂x j − ∂ ∂x j u j s + s .
(5) 1005 J kg −1 K −1 Heat capacity of dry air at constant pressure g 9.81 m s −2 Gravitational acceleration L V 2.5 × 10 6 J kg −1 Latent heat of vaporization p 0 1000 hPa Reference air pressure R d 287 J kg −1 K −1 Specific gas constant for dry air R v 461.51 J kg −1 K −1 Specific gas constant for water vapor κ 0.4 Kármán constant ρ kg m −3 Density of dry air ρ 0 1.0 kg m −3 Density of dry air at the surface ρ l,0 1003 kg m −3 Density of liquid water 0.729 × 10 −4 rad s −1 Angular velocity of the Earth Here, i, j, k ∈ {1, 2, 3}.u i are the velocity components (u 1 = u, u 2 = v, u 3 = w) with location x i (x 1 = x, x 2 = y, x 3 = z), t is time, f i = (0, 2 cos(φ), 2 sin(φ)) is the Coriolis parameter with being the Earth's angular velocity and φ being the geographical latitude.u g,k are the geostrophic wind speed components, ρ 0 is the density of dry air, π * = p * + 2 3 ρ 0 e is the modified perturbation pressure with p * being the perturbation pressure and the SGS-TKE e = 1 2 u i u i , and g is the gravitational acceleration.The potential temperature is defined as with the current absolute temperature T and the Exner function with p being the hydrostatic air pressure, p 0 = 1000 hPa a reference pressure, R d the specific gas constant for dry air, and c p the specific heat of dry air at constant pressure.The virtual potential temperature is defined as with the specific gas constant for water vapor R v , and the liquid water specific humidity q l .For the computation of q l , see the descriptions of the embedded cloud microphysical models in Sects.3.1 and 3.3.Furthermore, L V is the latent heat of vaporization, and q v and s are source/sink terms of q v and s, respectively.

Turbulence closure
One of the main challenges in LES modeling is the turbulence closure.The filtering process yields four SGS covariance terms (see Eqs. 1-5) that cannot be explicitly calculated.
In PALM, these SGS terms are parametrized using a 1.5order closure after Deardorff (1980).PALM uses the modified version of Moeng and Wyngaard (1988) and Saiki et al. (2000).The closure is based on the assumption that the energy transport by SGS eddies is proportional to the local gradients of the mean quantities and reads where K m and K h are the local SGS eddy diffusivities of momentum and heat, respectively.They are related to the SGS-TKE as follows: Here, c m = 0.1 is a model constant and = 3 √ x y z with x, y, z being the grid spacings in the x, y and z directions, respectively.The SGS mixing length l depends on height z (distance from the wall when topography is used), , and stratification, and is calculated as for ∂θ v ∂z ≤ 0.  K m s −1 Upward vertical kinematic heat flux q kg kg −1 Total water content q l kg kg −1 Liquid water specific humidity Transport velocity of the indexed velocity component at the outlet A prognostic variable (u, v, w, θ/θ l , q v /q, s, e) ϕ LS Large-scale value of ϕ The pressure term in Eq. ( 16) is parametrized as and is the SGS dissipation rate within a grid volume, given by = 0.19 + 0.74 Since θ v depends on θ , q v , and q l (see Eq. 8), the vertical SGS buoyancy flux w θ v depends on the respective SGS fluxes (Stull, 1988, Chap. 4.4.5): with and the vertical SGS flux of liquid water, calculated as Note that this parametrization of the SGS buoyancy flux (Eq.19) differs from that used with bulk cloud microphysics (see Sect. 3.1.8).

Discretization
The model domain in PALM is discretized in space using finite differences and equidistant horizontal grid spacings ( x, y).The grid can be stretched in the vertical direction well above the ABL to save computational time in the free atmosphere.The Arakawa staggered C-grid (Harlow and Welch, 1965;Arakawa and Lamb, 1977) is used, where scalar quantities are defined at the center of each grid volume, whereas velocity components are shifted by half a grid width in their respective direction so that they are defined at the edges of the grid volumes (see Fig. 2).It is thus possible to calculate the derivatives of the velocity components at the center of the volumes (same location as the scalars).By the same token, derivatives of scalar quantities can be calculated at the edges of the volumes.In this way it is possible to calculate derivatives over only one grid length and the effective spatial model resolution can be increased by a factor of 2 in comparison to non-staggered grids.
By default, the advection terms in Eqs. ( 1)-( 5) are discretized using an upwind-biased fifth-order differencing Figure 2. The Arakawa staggered C-grid.The indices i, j and k refer to grid points in the x, y and z directions, respectively.Scalar quantities ϕ are defined at the center of the grid volume, whereas velocities are defined at the edges of the grid volumes.
scheme in combination with a third-order Runge-Kutta timestepping scheme after Williamson (1980).Wicker and Skamarock (2002) compared different time and advection differencing schemes and found that this combination gives the best results regarding accuracy and algorithmic simplicity.However, the fifth-order differencing scheme is known to be overly dissipative.It is thus also possible to use a second-order scheme after Piacsek and Williams (1970).The latter scheme is non-dissipative, but it suffers from immense numerical dispersion.Time discretization can also be achieved using second-order Runge-Kutta or first-order Euler schemes.

Pressure solver
The Boussinesq approximation requires incompressibility of the flow, but the integration of the governing equations formulated in Sect.2.1 does not provide this feature.Divergence of the flow field is thus inherently produced.Hence, a predictor-corrector method is used where an equation is solved for the modified perturbation pressure after every time step (e.g., Patrinos and Kistler, 1977).In a first step, the pressure term −(1/ρ 0 )∂π * /∂x i is excluded from Eq. ( 1) during time integration.This yields a preliminary velocity u t+ t i,pre at time t + t.Emerging divergences can then be attributed to the pressure term.Subsequently, the prognostic velocity can be decomposed in a second step as The third step then is to stipulate incompressibility for u t+ t i : www.The result is a Poisson equation for π * : The exact solution of Eq. ( 25) would give a π * that yields a u t+ t i free of divergence when used in Eq. ( 23).In practice, a numerically efficient reduction of divergence by several orders of magnitude is found to be sufficient.Note that the differentials in Eqs. ( 23)-( 25) are used for convenience and that the model code uses finite differences instead.When employing a Runge-Kutta time stepping scheme, the formulation above is used to solve the Poisson equation for each substep.π * is then calculated from its weighted average over these substeps.
In the case of cyclic lateral boundary conditions, the solution of Eq. ( 25) is achieved by using a direct fast Fourier transform (FFT).The Poisson equation is Fourier transformed in both horizontal directions; the resulting tridiagonal matrix is solved along the z direction, and then transformed back (see, e.g., Schumann and Sweet, 1988).PALM provides the inefficient but less restrictive Singleton FFT (Singleton, 1969) and the well-optimized Temperton FFT (Temperton, 1992).External FFT libraries can be used as well, with the FFTW (Frigo and Johnson, 1998) being the most efficient one.Alternatively, the iterative multigrid scheme can be used (e.g., Hackbusch, 1985).This scheme uses an iterative successive over-relaxation (SOR) method for the inner iterations on each grid level.The convergence of this scheme is steered by the number of so-called V-or W-cycles to be carried out for each call of the scheme and by the number of SOR iterations to be carried out on each grid level.As the multigrid scheme does not require periodicity along the horizontal directions, it allows for using non-cyclic lateral boundary conditions.

Boundary conditions
PALM offers a variety of boundary conditions.Dirichlet or Neumann boundary conditions can be chosen for u, v, θ , q v , and p * at the bottom and top of the model.For the horizontal velocity components the choice of Neumann (Dirichlet) boundary conditions yields free-slip (no-slip) conditions.Neumann boundary conditions are also used for the SGS-TKE.Kinematic fluxes of heat and moisture can be prescribed at the surface instead (Neumann conditions) of temperature and humidity (Dirichlet conditions).At the top of the model, Dirichlet boundary conditions can be used with given values of the geostrophic wind.By default, the lowest grid level (k = 0) for the scalar quantities and horizontal velocity components is not staggered vertically and defined at the surface (z = 0).In case of free-slip boundary conditions at the bottom of the model, the lowest grid level is defined below the surface (z = −0.5 • z) instead.Vertical velocity is assumed to be zero at the surface and top boundaries, which implies using Neumann conditions for pressure.
Following Monin-Obukhov similarity theory (MOST), a constant flux layer can be assumed as the boundary condition between the surface and the first grid level where scalars and horizontal velocities are defined (k = 1, z MO = 0.5• z).It is then required to provide the roughness lengths for momentum z 0 and heat z 0,h .Momentum and heat fluxes as well as the horizontal velocity components are calculated using the following framework.The formulation is theoretically only valid for horizontally averaged quantities.In PALM we assume that MOST can also be applied locally and we therefore calculate local fluxes, velocities, and scaling parameters.
Following MOST, the vertical profile of the horizontal wind velocity 1 2 is given in the surface layer by where κ = 0.4 is the Von Kármán constant and m is the similarity function for momentum in the formulation of Businger-Dyer (see, e.g., Panofsky and Dutton, 1984): Here, L is the Obukhov length, calculated as The scaling parameters θ * and q * are defined by MOST as with the friction velocity u * defined as In PALM, u * is calculated from u h at z MO by vertical integration of Eq. ( 26) over z from z 0 to z MO .
From Eqs. ( 26) and ( 30) it is possible to derive a formulation for the horizontal wind components, viz.
Vertical integration of Eq. ( 31) over z from z 0 to z MO then yields the surface momentum fluxes u w 0 and v w 0 .
The formulations above all require knowledge of the scaling parameters θ * and q * .These are deduced from vertical integration of over z from z 0,h to z MO .The similarity function h is given by h Note that this implementation of MOST in PALM requires the use of data from the previous time step.The following steps are thus carried out in sequential order.First of all, θ * and q * are calculated by integration of Eq. ( 32) using the value of z MO /L from the previous time step.Second, the new value of z MO /L is derived from Eq. ( 28) using the new values of θ * and q * , but using u * from the previous time step.Then, the new values of u * , and subsequently u w 0 as well as v w 0 , are calculated by integration of Eqs. ( 26) and (31), respectively.At last, Eq. ( 29) is employed to calculate the new surface fluxes w θ 0 and w q v 0 .In the special case, when surface fluxes are prescribed instead of surface temperature and humidity, the first and last steps are omitted and θ * and q * are directly calculated using Eq. ( 29) instead.
Furthermore, the flat bottom of the model can be replaced by a Cartesian topography (see Sect. 2.5.4).
By default, lateral boundary conditions are set to be cyclic in both directions.Alternatively, it is possible to opt for noncyclic conditions in one direction, i.e., a laminar or turbulent inflow boundary (see Sect. 2.5.1) and an open outflow boundary on the opposite side (see Sect. 2.5.3).The boundary conditions for the other direction have to remain cyclic.
In order to prevent gravity waves from being reflected at the top boundary, a sponge layer (Rayleigh damping) can be applied to all prognostic variables in the upper part of the model domain (Klemp and Lilly, 1978).Such a sponge layer should be applied only within the free atmosphere, where no turbulence is present.
The model is initialized by horizontally homogeneous vertical profiles of potential temperature, specific humidity (or a passive scalar), and the horizontal wind velocities.The latter can also be provided from a 1-D precursor run (see Sect. 3.5).Uniformly distributed random perturbations with a user-defined amplitude can be imposed to the fields of the horizontal velocity components to initiate turbulence.

Laminar and turbulent inflow boundary conditions
In case of laminar inflow, Dirichlet boundary conditions are used for all quantities, except for the SGS-TKE e and perturbation pressure π * , for which Neumann boundary conditions are used.Vertical profiles, as taken for the initialization of the simulation, are used for the Dirichlet boundary conditions.In order to allow for a fast turbulence development, random perturbations can be imposed on the velocity fields within a certain area behind the inflow boundary (inlet).These perturbations may persist for the entire simulation.For the purpose of preventing gravity waves from being reflected at the inlet, a relaxation area can be defined after Davies (1976).So far, it was found to be sufficient to implement this method for temperature only.This is hence realized by an additional term in the prognostic equation for θ (see Eq. 3): Here, θ inlet is the stationary inflow profile of θ , and C relax is a relaxation coefficient, depending on the distance d from the inlet, viz.
with D being the length of the relaxation region and F inlet being a damping factor.

Turbulence recycling
If non-cyclic horizontal boundary conditions are used, PALM offers the possibility of generating time-dependent turbulent inflow data by using a turbulence recycling method.
The method follows the one described by Lund et al. (1998), with the modifications introduced by Kataoka and Mizuno (2002).Figure 3 gives an overview of the recycling method used in PALM.The turbulent signal ϕ (y, z, t) is taken from a recycling plane that is located at a fixed distance x recycle from the inlet: where ϕ y (z, t) is the line average of a prognostic variable ϕ ∈ {u, v, w, θ, e} along y at x = x recycle .ϕ (y, z) is then added to the mean inflow profile ϕ inflow y (z) at x inlet after each time step: with the inflow damping function φ(z), which has a value of 1 below the initial boundary-layer height, and which is linearly damped to 0 above, in order to inhibit growth of the boundary-layer depth.ϕ inlet y (z) is constant in time and either calculated from the results of the precursor run or prescribed by the user.The distance x recycle has to be chosen to be much larger than the integral length scale of the respective turbulent flow.Otherwise, the same turbulent structures could be recycled repeatedly, so that the turbulence spectrum is illegally modified.It is thus recommended to use a precursor run for generating the initial turbulence field of the main run.The precursor run can have a comparatively small domain along the horizontal directions.In that case the domain of the main run is filled by cyclic repetition of the precursor run data.Note that the turbulence recycling has not been adapted for humidity and passive scalars so far.
Turbulence recycling is frequently used for simulations with urban topography.In such a case, topography elements should be placed sufficiently downstream of x recycle to prevent effects on the turbulence at the inlet.

Open outflow boundary conditions
At the outflow boundary (outlet), the velocity components u i meet radiation boundary conditions, viz.
as proposed by Orlanski (1976).Here ∂/∂n is the derivative normal to the outlet (i.e., ∂/∂x in Fig. 3) and U u i a transport velocity that includes wave propagation and advection.Rewriting Eq. (38) yields the transport velocity that is calculated at interior grid points next to the outlet at the preceding time step for each velocity component.If the transport velocity, calculated by means of Eq. ( 39), is outside the range 0 ≤ U u i ≤ / t, it is set to the respective threshold value that is exceeded.Because this local determination of U u i can show high variations in case of complex turbulent flows, it is averaged laterally to the direction of the outflow, so that it varies only in the vertical direction.Alternatively, the transport velocity can be set to the upper threshold value (U u i = / t) for the entire outlet.Equations ( 38) and (39) are discretized using an upstream method following Miller and Thorpe (1981).As the radiation boundary condition does not ensure conservation of mass, a mass flux correction can be applied at the outlet.

Topography
The Cartesian topography in PALM is generally based on the mask method (Briscolini and Santangelo, 1989) and allows for explicitly resolving solid obstacles such as buildings and orography.The implementation makes use of the following simplifications: 1. the obstacle shape is approximated by (an appropriate number of) full grid cells to fit the grid, i.e., a grid cell is either 100 % fluid or 100 % obstacle, 2. so far, only bottom surface-mounted obstacles are permitted (no holes or overhanging structures), and 3. the obstacles are fixed (not moving).
These simplifications transform the 3-D obstacle dimension into a 2.5-D topography.This reduced dimension format conforms to the digital elevation model (DEM) format.DEMs of city morphologies have become increasingly available worldwide due to advances in remote sensing technologies.Consequently, it is sufficient to provide 2-D topography height data to mask obstacles and their faces in PALM.The model domain is then separated into three subdomains (see Fig. 4): A. grid points in free fluid without adjacent walls, where the standard PALM code is executed, B. grid points next to walls that require extra code (e.g., wall functions), and C. grid points within obstacles that are excluded from calculations.
Additional topography code is only executed in grid volumes of subdomain B. The faces of the obstacles are always located where the respective wall-normal velocity components u, v, and w are defined (cf.Fig. 2) so that the impermeability boundary condition can be implemented by setting the respective wall-normal velocity component to zero.
An exception is made for the fifth-order advection scheme, where the numerical stencil at grid points adjacent to obstacles would require data within the obstacle.In order to avoid this behavior, the order of the advection scheme is successively degraded at respective grid volumes adjacent to obstacles, i.e., from the fifth order to third order at the second grid point above/beside an obstacle and from the third order to a second order at grid points directly adjacent to an obstacle.
Wall surfaces in PALM can be aligned horizontally (bottom surface or rooftop, i.e., always facing upwards) or vertically (facing the north, east, south or west direction).At horizontal surfaces, PALM allows us to either specify the surface values (θ , q v , s) or to prescribe their respective surface fluxes.The latter is the only option for vertically oriented surfaces.Simulations with topography require the application of MOST between each wall surface and the first computational grid point.For vertical walls, neutral stratification is assumed for MOST.The topography implementation has been validated by Letzel et al. (2008) and Kanda et al. (2013).Park and Baik (2013) have recently extended the vertical wall boundary conditions for non-neutral stratifications and validated their results against wind tunnel data.Up to now, however, these modifications have not been included in PALM 4.0.Figure 5 shows exemplarily the development of turbulence structures induced by a densely built-up artificial island off the coast of Macau, China (see also animation in Knoop et al., 2014).The approaching flow above the sea exhibits relatively weak turbulence due to the smooth water surface.Within the building areas, strong turbulence is generated by additional wind shear (due to the walls of isolated buildings) and due to a general increase in surface roughness.
The technical realization of the topography will be outlined in Sect.4.3.

Large-scale forcing
Processes occurring on larger scales (LS) than usually considered in LES and that affect the local LES scales have to be prescribed by additional source terms.These LS processes include pressure gradients via the geostrophic wind, subsidence and horizontal advection of scalars.In case of cyclic boundary conditions, this forcing is prescribed homogeneously in the horizontal directions and thus depends on height and time only.The relation between LS pressure (p LS ) gradient and geostrophic wind is given by and enters Eq. ( 1).LS vertical advection (subsidence or ascent) tendencies can be prescribed for the scalar prognostic variables ϕ ∈ {θ, q, s} by means of The so-called subsidence velocity w LS and the geostrophic wind components u g and v g can either be prescribed gradient-wise or they can be provided in an external file.Moreover, an external pressure gradient can be applied for simulations with Coriolis force switched off, which is usually required for simulations to be compared with wind tunnel experiments.
To account for less idealized flow situations, timedependent surface fluxes (or surface temperature and humidity) can be prescribed.Moreover, LS horizontal advective These tendencies are typically derived from larger-scale models or observations and should be spatially averaged over a large domain so that local-scale perturbations are avoided.
Newtonian relaxation (nudging) towards given large-scale profiles ϕ LS can be used for ϕ ∈ {u, v, θ, q, s} via τ LS is a relaxation timescale that, on the one hand, should be chosen large enough on the order of several hours to allow an undisturbed development of the small-scale turbulence in the LES model.On the other hand, it should be chosen small enough to account for synoptic disturbances (Neggers et al., 2012).In this way, the nudging can prevent considerable model drift in time.

Ocean option
PALM allows for studying the OML by using an ocean option where the sea surface is defined at the top of the model, so that negative values of z indicate the depth.Hereafter, we keep the terminology and use the word surface and index 0 for variables at the sea surface and top of the ocean model.For a list of ocean-specific parameters, see Table 3.
The ocean version differs from the atmospheric version by a few modifications, which are handled in the code by distinction of cases, so that both versions share the same basic code.In particular, seawater buoyancy and static stability depend not only on θ, but also on the salinity Sa.In order to account for the effect of salinity on density, a prognostic equation is added for Sa (in PSU, practical salinity unit): where Sa represents sources and sinks of salinity.Furthermore, θ v is replaced by potential density ρ θ in the buoyancy term of Eq. ( 1) in the stability-related term of the SGS-TKE equation (Eq.16) as well as in the calculation of the mixing length (Eq.15) ρ θ is calculated from the equation of state of seawater after each time step using the algorithm proposed by Jackett et al. (2006).The algorithm is based on polynomials depending on Sa, θ , and p (see Jackett et al., 2006, Table A2).At the moment, only the initial values of p enter this equation.
The ocean is driven by prescribed fluxes of momentum, heat and salinity at the top.The boundary conditions at the bottom of the model can be chosen as for atmospheric runs, including the possibility to use topography at the sea bottom.
Note that the current version of the ocean option does not account for the effect of surface waves (e.g., Langmuir circulation and wave-breaking).Parametrization schemes might, however, be provided within the user interface (see Sect. 4.5) and have been used, e.g., by Noh et al. (2004).The ocean option in its current state was recently used for simulations of the ocean mixed layer by Esau (2014), who investigated indirect air-sea interactions by means of the atmosphereocean coupling scheme that will be described in Sect.2.8.Note that most previous PALM studies of the OML used the atmospheric code, subsequent inversion of the z axis and appropriate normalization of the results, instead of using the relatively new ocean option (e.g., Noh et al., 2004Noh et al., , 2009)).

Coupled atmosphere-ocean simulations
A coupled mode for the atmospheric and oceanic versions of PALM has been developed in order to allow for studying the interaction between turbulent processes in the ABL and OML.The coupling is realized by the online exchange of information at the sea surface (boundary conditions) between two PALM runs (one atmosphere and one ocean).The atmospheric model uses a constant flux layer and transfers the kinematic surface fluxes of heat and moisture as well as the momentum fluxes to the oceanic model.Flux conservation between the ocean and the atmosphere requires an adjustment of the fluxes for the density of water ρ l,0 : Since evaporation leads to cooling of the surface water, the kinematic flux of heat in the ocean depends on both the atmospheric kinematic surface fluxes of heat and moisture and is calculated by Here, c p,l is the specific heat of water at constant pressure.Since salt does not evaporate, evaporation of water also leads to an increase in salinity in the ocean subsurface.This process is modeled after Steinhorn (1991) by a negative (downward) salinity flux at the sea surface: Sea surface values of potential temperature and the horizontal velocity components are transferred as surface boundary conditions to the atmosphere: The time steps for atmosphere and ocean are set individually and are not required to be equal.The coupling is then executed at a user-prescribed frequency.At the moment, the coupling requires equal extents of the horizontal model domains in both atmosphere and ocean.In order to account for the fact that eddies in the ocean are generally smaller but usually have lower velocities than in the atmosphere, it is beneficial to use different grid spacings in both models (i.e., finer grid spacing in the ocean model).In this case, the coupling is realized by a two-way bi-linear interpolation of the data fields at the sea surface.Furthermore, it is possible to perform uncoupled precursor runs for both atmosphere and ocean, followed by a coupled restart run.In this way it is possible to reduce the computational load due to different spin-up times in atmosphere and ocean.
As mentioned above, this coupling has been successfully applied for the first time in the recent study of Esau (2014).Furthermore, we would encourage the atmospheric and oceanic scientific community to consider the coupled atmosphere-ocean LES technique for further applications in the future.

Embedded models
PALM offers several optional embedded models that can be switched on for special purposes.In this section we will describe the embedded cloud microphysics model (Sect.3.1, Table 4), the LPM for use of Lagrangian particles as passive tracers (Sect.3.2, Table 5), the LCM that uses the LPM for the simulation of explicit cloud droplets and aerosols (Sect.3.3), and the canopy model (Sect.3.4, Table 6).Moreover, we will outline the one-dimensional (1-D) version of PALM in Sect.3.5, which is used for creating steady-state wind profiles to be used as initialization of the 3-D model.

Cloud microphysics
PALM offers an embedded bulk cloud microphysics representation that takes into account the liquid water specific humidity and warm (i.e., no ice) cloud-microphysical processes.Therefore, PALM solves the prognostic equations for the total water content instead of q v , and for a linear approximation of the liquid water potential temperature (e.g., Emanuel, 1994) instead of θ as described in Sect.2.1.Since q and θ l are conserved quantities for wet adiabatic processes, condensation/evaporation is not considered for these variables.
Liquid-phase microphysics are parametrized following the two-moment scheme of Seifert andBeheng (2001, 2006), which is based on the separation of the droplet spectrum into droplets with radii < 40 µm (cloud droplets) and droplets with radii ≥ 40 µm (rain droplets).The model predicts the first two moments of these partial droplet spectra, namely cloud and rain droplet number concentration (N c and N r , respectively) as well as cloud-and rainwater specific humidity (q c and q r , respectively).Consequently, q l is the sum of both q c and q r .The moments' corresponding microphysical tendencies are derived by assuming the partial droplet spectra to follow a gamma distribution that can be described by the predicted quantities and empirical relationships for the distribution's slope and shape parameters.For a detailed derivation of these terms, see Seifert andBeheng (2001, 2006).
We employ the computational efficient implementation of this scheme as used in the UCLA-LES (Savic-Jovcic and Stevens, 2008) and DALES (Heus et al., 2010) models.We thus solve only two additional prognostic equations for N r and q r : with the sink/source terms N r and q r , and the SGS fluxes N c and q c then are a fixed parameter and a diagnostic quantity, respectively.
In the next subsections we will describe the diagnostic determination of q c .From Sect.3.1.2on, the microphysical processes considered in the sink/source terms of θ l , q, N r and q r , are used in the formulations of Seifert and Beheng (2006) unless explicitly specified.Section 3.1.8gives an overview of the necessary changes for the turbulence closure (cf.Sect.2.2) using q and θ l instead of q v and θ , respectively.

Diffusional growth of cloud water
The diagnostic estimation of q c is based on the assumption that water supersaturations are immediately removed by the diffusional growth of cloud droplets only.This can be justified since the bulk surface area of cloud droplets exceeds that of rain drops considerably (Stevens and Seifert, 2008).Following this saturation adjustment approach, q c is obtained by where q s is the saturation specific humidity.Because q s is a function of T (not predicted), q s is computed from the liquid water temperature T l = θ l in a first step: using an empirical relationship for the saturation water vapor pressure p v, s (Bougeault, 1981): q s (T ) is subsequently calculated from a first-order Taylor series expansion of q s at T l (Sommeria and Deardorff, 1977): with (66)

Autoconversion
In the following Sects.3.1.2-3.1.4,we describe collision and coalescence processes by applying the stochastic collection equation (e.g., Pruppacher and Klett, 1997, Chap. 15.3) in the framework of the described two-moment scheme.As two species (cloud and rain droplets, hereafter also denoted as c and r, respectively) are considered only, there are three possible interactions affecting the rain quantities: autoconversion, accretion, and self-collection.Autoconversion summarizes all merging of cloud droplets resulting in rain drops (c + c → r).Accretion describes the growth of rain drops by the collection of cloud droplets (r + c → r).Self-collection denotes the merging of rain drops (r + r → r).The local temporal change of q r due to autoconversion is Assuming that all new rain drops have a radius of 40 µm corresponding to the separation mass m sep = 2.6 × 10 −10 kg, the local temporal change of N r is Here, K auto = 9.44 × 10 9 m 3 kg −2 s −1 is the autoconversion kernel, µ c = 1 is the shape parameter of the cloud droplet gamma distribution and m c = ρ q c /N c is the mean mass of cloud droplets.τ c = 1 − q c /(q c + q r ) is a dimensionless timescale steering the autoconversion similarity function (69) The increase in the autoconversion rate due to turbulence can be considered optionally by an increased autoconversion kernel depending on the local kinetic energy dissipation rate after Seifert et al. (2010).

Accretion
The increase in q r by accretion is given by ∂q r ∂t accr = K accr q c q r accr (τ c )(ρ 0 ρ) with the accretion kernel K accr = 4.33 m 3 kg −1 s −1 and the similarity function Turbulence effects on the accretion rate can be considered after using the kernel after Seifert et al. (2010).

Self-collection and breakup
Self-collection and breakup describe merging and splitting of rain drops, respectively, which affect the rainwater drop number concentration only.Their combined impact is parametrized as with the breakup function break = 0 for r r < 0.15 × 10 −3 m, depending on the volume-averaged rain drop radius the equilibrium radius r eq = 550 × 10 −6 m and the breakup kernel with the self-collection kernel K self = 7.12 m 3 kg −1 s −1 .

Evaporation of rainwater
The evaporation of rain drops in subsaturated air (relative water supersaturation S < 0) is parametrized following Seifert (2008): where with K v = 2.3 × 10 −5 m 2 s −1 being the molecular diffusivity water vapor in air and λ h = 2.43 × 10 −2 W m −1 K −1 being the heat conductivity of air.Here, N r λ µ r +1 r / (µ r + 1) denotes the intercept parameter of the rain drop gamma distribution with being the gamma function.Following Stevens and Seifert (2008), the slope parameter reads as with µ r being the shape parameter, given by In order to account for the increased evaporation of falling rain drops, the so-called ventilation effect, a ventilation factor f v is calculated optionally by a series expansion considering the rain drop size distribution (Seifert, 2008, Appendix).
The complete evaporation of rain drops (i.e., their evaporation to a size smaller than the separation radius of 40 µm) is parametrized as with γ = 0.7 (see also Heus et al., 2010).

Sedimentation of cloud water
As shown by Ackerman et al. (2009), the sedimentation of cloud water has to be taken into account for the simulation of stratocumulus clouds.They suggest the cloud water sedimentation flux to be calculated as based on a Stokes drag approximation of the terminal velocities of log-normal distributed cloud droplets.Here, k = 1.2 × 10 8 m −1 s −1 is a parameter and σ g = 1.3 the geometric SD of the cloud droplet size distribution (Geoffroy et al., 2010).The tendency of q results from the sedimentation flux divergences and reads as

Sedimentation of rainwater
The sedimentation of rainwater is implemented following Stevens and Seifert (2008).The sedimentation velocities are based on an empirical relation for the terminal fall velocity after Rogers et al. (1993).They are given by and The resulting sedimentation fluxes F N r and F q r are calculated using a semi-Lagrangian scheme and a slope limiter (see Stevens and Seifert, 2008, their Appendix A).The resulting tendencies read as and ∂q ∂t sed, r = ∂q r ∂t sed, r . (85)

Turbulence closure
Using bulk cloud microphysics, PALM predicts liquid water temperature θ l and total water content q instead of θ and q v .Consequently, some terms in Eq. ( 19) are unknown.We thus follow Cuijpers and Duynkerke (1993) and calculate the SGS buoyancy flux from the known SGS fluxes w θ l and w q .
In unsaturated air (q c = 0) Eq. ( 19) is then replaced by with and in saturated air (q c > 0) by

Recent applications
The two-moment cloud microphysics scheme has been used within the framework of the HD(CP) 21 project to produce LES simulation data for the evaluation and benchmarking of ICON-LES (Dipankar et al., 2015).Figure 6

Lagrangian particle model (LPM)
The embedded LPM allows for studying transport and dispersion processes within turbulent flows.In the following we will describe the general modeling of particles, including passive particles that do not show any feedback on the turbulent flow.In Sect.3.3 we will describe the use of Lagrangian particles as explicit cloud droplets.

Formulation of the LPM
Lagrangian particles can be released in prescribed source volumes at different points in time.The particles then obey where x p,i describes the particle location in the x i direction (i ∈ {1, 2, 3}) and u p,i is the respective velocity component Shown is the 3-D field of q c (white to gray colors) as well as rainwater (q r > 0, blue) on 26 April 2013.The simulation had a grid spacing of 50 m on a 50 × 50 km 2 domain.Large-scale advective tendencies for θ l and q were taken from COSMOE-DE (regional model of the German Meteorological Service, DWD) analyses.The copyright for the underlying satellite image is held by Cnes/Spot Image, Digitalglobe.
of the particle.Particle trajectories are calculated by means of the turbulent flow fields provided by PALM for each time step.The location of a certain particle at time t + t L is calculated by where x ps,i is the spatial coordinate of the particle source point and t L is the applied time step in the Lagrangian particle model.Note that the latter is not necessarily equal to the time step of the LES model.The integral in Eq. ( 92) is evaluated using either a Runge-Kutta (second-or third-order) or (first-order) Euler time-stepping scheme.
The velocity of a weightless particle that is transported passively by the flow is determined by and for non-passive particles (e.g., cloud droplets) by du p,i dt = considering Stoke's drag, gravity and buoyancy (on the righthand side, from left to right).Note that Eq. ( 94) is solved analytically assuming all variables but u p,i as constants for one time step.Here, u i (x p,i ) is the velocity of air at the particles' location gathered from the eight adjacent grid points of the LES by tri-linear interpolation (see Sect. 4.2).Since Stoke's drag is only valid for radii ≤ 30 µm (e.g., Rogers and Yau, 1989), a nonlinear correction is applied to the Stokes drag relaxation timescale: Here, r is the radius of the particle, ν = 1.461 × 10 −5 m 2 s the molecular viscosity of air, and ρ p,0 the density of the par-ticle.The particle Reynolds number is given by Following Lamb (1978) and the concept of LES modeling, the Lagrangian velocity of a weightless particle can be split into a resolved-scale contribution u res p and an SGS contribution u sgs p : u res p,i is determined by interpolation of the respective LES velocity components u i to the position of the particle.The SGS part of the particle velocity at time t is given by where du sgs p,i describes the temporal change of the SGS particle velocity during a time step of the LPM based on Thomson (1987).Note that the SGS part of u p,i in Eq. ( 92) is always computed using the (first-order) Euler time-stepping scheme.Weil et al. (2004) developed a formulation of the Langevin equation under the assumption of isotropic Gaussian turbulence in order to treat the SGS particle dispersion in terms of a stochastic differential equation.This equation reads as and is used in PALM for the determination of the change in SGS particle velocities.Here, Thomson, 1987).ζ is a vector composed of Gaussian-shaped random numbers, with each component neither spatially nor temporally correlated.The factor where e res is the resolved-scale TKE as resolved by the numerical grid, ensuring that the temporal change of the modeled SGS particle velocities is, on average (horizontal mean), smaller than the change in the resolved-scale particle velocities (Weil et al., 2004).Values of e and are provided by the SGS model (see Eqs. 16 and 18,respectively).The first term on the right-hand side of Eq. ( 99) represents the influence of the SGS particle velocity from the previous time step (i.e., inertial "memory").This effect is considered by the Lagrangian timescale after Weil et al. (2004): which describes the time span during which u sgs p (t − t L ) is correlated with u sgs p (t).The applied time step of the particle model hence must not be larger than τ L .In PALM, the particle time step is set to be smaller than τ L /40.The second term on the right-hand side of Eq. ( 99) ensures that the assumption of well-mixed conditions by Thomson (1987) is fulfilled on the subgrid scales.This term can be considered as drift correction, which shall prevent an over-proportional accumulation of particles in regions of weak turbulence (Rodean, 1996).The third term on the right-hand side of Eq. ( 99) is of a stochastic nature and describes the SGS diffusion of particles by a Gaussian random process.For a detailed derivation and discussion of Eq. ( 99), see Thomson (1987), Rodean (1996) and Weil et al. (2004).
The required values of the resolved-scale particle velocity components, e, and are obtained from the respective LES fields using the eight adjacent grid points of the LES and tri-linear interpolation on the current particle location (see Sect. 4.2).An exception is made in case of no-slip boundary conditions set for the resolved-scale horizontal wind components below the first vertical grid level above the surface.
Here, the resolved-scale particle velocities are determined from MOST (see Sect. 2.5) in order to capture the logarithmic wind profile within the height interval of z 0 to z MO .The available values of u * , w u 0 , and w v 0 are first bi-linearly interpolated to the horizontal location of the particle.In a second step the velocities are determined using Eqs.( 30)-( 31).Resolved-scale horizontal velocities of particles residing at height levels below z 0 are set to zero.The LPM allows us to switch off the transport by the SGS velocities.

Boundary conditions and release of particles
Different boundary conditions can be used for particles.They can be either reflected or absorbed at the surface and top of the model.The lateral boundary conditions for particles can either be set to absorption or cyclic conditions.
The user can explicitly prescribe the release location and events as well as the maximum lifetime of each particle.Moreover, the embedded LPM provides an option for defining different groups of particles.For each group the horizontal and vertical extensions of the particle source volumes as well as the spatial distance between the released particles can be prescribed individually for each source area.In this way it is possible to study the dispersion of particles from different source areas simultaneously.

Recent applications
The embedded LPM has been recently applied for the evaluation of footprint models over homogeneous and heterogeneous terrain (Steinfeld et al., 2008;Markkanen et al., 2009Markkanen et al., , 2010;;Sühring et al., 2014).For example, Steinfeld et al. (2008) calculated vertical profiles of crosswind-integrated particle concentrations for continuous point sources and found good agreement with the convective tank experiments of Willis and Deardorff (1976), as well as with LES results presented by Weil et al. (2004).Moreover, Steinfeld et al. (2008) calculated footprints for turbulence measurements and showed the benefit of the embedded LPM for footprint prediction compared to Lagrangian dispersion models with fully parametrized turbulence.Noh et al. (2006) used the LPM to study the sedimentation of inertial particles in the OML.Moreover, the LPM has been used for visualizing urban canopy flows as well as dust-devil-like vortices (Raasch and Franke, 2011).

Lagrangian cloud model (LCM)
The LCM is based on the formulation of the LPM (Sect.3.2).For the LCM, however, the Lagrangian particles represent droplets and aerosols.The droplet advection and sedimentation are given by Eqs. ( 94) and (95) with ρ p,0 = ρ l,0 .At present it is computationally not feasible to simulate a realistic number of particles.A single Lagrangian particle thus represents an ensemble of identical particles (i.e., same radius, velocity, mass of solute aerosol) and is referred to as a "super-droplet".The number of particles in this ensemble is referred to as the "weighting factor".For example, q l of a certain LES grid volume results from all Lagrangian particles located therein considering their individual weighting factor A n : with N p being the number of particles inside the grid volume of size V , and r n being the radius of the particle.The concept of weighting factors and super-droplets in combination with LES has also been used similarly by Andrejczuk et al. (2008) and Shima et al. (2009) for warm clouds, as well as by Sölch and Kärcher (2010) for ice clouds.

Diffusional growth
The growth of a particle by diffusion of water vapor, i.e., condensation and evaporation, is described by with the coefficients and depending primarily on the diffusion of water vapor in air and heat conductivity of air, respectively.f v is the ventilation factor, which accounts for the increased diffusion of water vapor, particularly the accelerated evaporation of large drops precipitating from a cloud (e.g., Pruppacher and Klett, 1997, Chap.13.2.3): Here, Re p is the particle Reynolds number.The relative water supersaturation S is computed from the LES values of θ and q v , tri-linearly interpolated to the particle's position.The equilibrium saturation term S eq considers the impact of surface tension as well as the physical and chemical properties of the solute aerosol on the equilibrium saturation of the droplet.In order to take into account these effects, the optional activation model for fully soluble aerosols must be switched on: with coefficients for surface tension and physical and chemical properties Here, ϑ is the temperature-dependent surface tension, and M l = 18.01528 g mol −1 the molecular mass of water.Depending on the simulation setup (e.g., continental or maritime conditions), the physical and chemical properties of the aerosol, its mass m s , molecular mass M s , and the van't Hoff factor F vH , indicating the degree of the solute aerosol's dissociation, are prescribed.As discussed by Hoffmann et al. (2015), the aerosol mass (or equivalently aerosol radius) can be specified by an additional particle feature allowing the initialization of aerosol mass distributions, i.e., varying aerosol masses among the simulated particle ensemble.
In summary, diffusional growth is the major coupling between the LES and LCM model.The change in water vapor during one time step is considered in the prognostic equations for potential temperature (see Eq. 3) and specific humidity (see Eq. 4) by Here, r n and r * n are the radius of the nth droplet before and after diffusional growth, respectively.Since the diffusional growth (see Eq. 103) is a stiff differential equation, we use a fourth-order Rosenbrock method (Press et al., 1996;Grabowski et al., 2011), adapting its internal time step for both a computationally efficient and numerically accurate solution.

Collision and coalescence
Collision and coalescence are computed using a statistical approach that allows the collision of all droplets that are currently located in the same LES grid volume.For this purpose, two quantities are predicted: the weighting factor, i.e., the number of droplets represented by a super-droplet, and the bulk mass of all droplets represented by a superdroplet, m n = A n (4/3) π ρ l r 3 n .For the collision of a superdroplet with a super-droplet smaller in radius, we assume that the larger droplets merge with a certain number of smaller droplets.Thereby, the weighting factor of the larger super-droplet is kept constant, while bulk mass and consequently radius increase (see Fig. 7a).On the other hand, the weighting factor and bulk mass of the smaller superdroplet decrease according to the number of droplets lost to the larger super-droplet, keeping the smaller super-droplet's radius constant.As described in Riechelmann et al. (2015), we allow the droplets represented by a single super-droplet to collide among each other.These internal collisions only decrease the weighting factor of the super-droplet but not the bulk mass.Consequently, internal collisions increase the super-droplet's radius (see Fig. 7b).The collision kernel K, which describes the collision probability of two droplets, can either be a purely gravitational one (Hall, 1980) or include turbulence effects (Ayala et al., 2008).
We arrange the droplets by radius such that r n r n+1 .The weighting factor after one collision/coalescence time step then reads as The asterisk denotes a quantity after one collision/coalescence time step.On the right-hand side, we consider the initial weighting factor (first term), the loss of droplets due to internal collisions (second term), and the loss of droplets due to collision with all larger droplets (third term).Note that collision with smaller droplets does not change the weighting factor of the larger droplet.
Since the mass of all droplets represented by a single super-droplet is not a useful quantity, we predict the volumeaveraged radius of all droplets represented by a super-droplet directly: On the right-hand side, the nominator (first pair of round brackets) contains the initial mass (first term), the gain of mass due to collisions with all smaller droplets (second term), and the loss of mass due to collisions with all larger droplets (third term).The denominator (second pair of round brackets) is identical to Eq. ( 110) divided by A n .

Recent applications
The LCM was validated against traditional bulk models for the BOMEX 2 experiment (see Siebesma et al., 2003)   (2015) investigated cloud droplet activation.Figure 8 shows the spatial distribution of simulated droplets and their respective radius within a simulated cumulus cloud.It appears that the largest drops (in terms of radius) are located at the top and the edges of the cloud, whereas smaller droplets tend to be located at the cloud base.

Canopy model
The embedded plant canopy model allows for studying the turbulent flow inside and above vegetation canopy.It is well known that vegetation canopy effects on the surfaceatmosphere exchange of momentum, energy and mass can be rather complex and can significantly modify the structure of the ABL, particularly in its lower part (e.g., Raupach et al., 1996;Dupont and Brunet, 2009).It is thus not possible to describe such processes by means of the roughness length and surface fluxes of sensible and latent heat.The canopy model in PALM accounts for the vertically extended drag, release of heat, plant evaporation and leaf-air interactions that are functions of height within the canopy layer.Dynamical effects of the plant canopy are based on the assumption that the canopy acts as a sink for momentum due to form (pressure) and viscous drag forces.This sink for momentum is modeled following Shaw and Schumann (1992) and Watanabe (2004) by adding the term C u i to Eq. (1): Here, C u i represents the net resolved-scale dynamical effect of the canopy, averaged over the respective grid volume.c d is the canopy drag coefficient with typical values around 0.2 (e.g., Cescatti and Marcolla, 2004), and LAD is the leaf area density (available leaf area per unit volume).As an example, LAD is rather constant with height within crop fields, whereas it is often very heterogeneous in forests, where most of the leaf area is concentrated in the trees' crown space (e.g., Yi, 2008).
The effect of the canopy on the SGS turbulence is considered by adding a similar sink term to the prognostic equation for SGS-TKE (see Eq. 16): This approach was suggested by Shaw and Schumann (1992) and is based on the assumption that SGS-TKE is dissipated by the canopy due to the rapid dissipation of wake turbulence in the lee of plant elements.This rapid breakdown of turbulence is also known as the spectral shortcut (e.g., Shaw and Patton, 2003).This type of canopy model has been successfully applied by various authors to study turbulent flows inside and above homogeneous as well as heterogeneous canopies such as forest edges (Cassiani et al., 2008;Finnigan et al., 2009;Dupont and Brunet, 2009, among others).
In case of incoming solar radiation the plant canopy acts as a source of heat.It is assumed that this warming of the foliage by solar radiation results in a warming of the surrounding air.This process is considered by adding a source term C θ to the prognostic equation for θ (see Eq. 3): In order to account for the fact that solar radiation can penetrate different layers of the canopy, based on the leaf area, an exponential decay function for the upward vertical kinematic heat flux Q θ after Brown and Covey (1966) is used.Q θ is derived at each height inside the canopy by means of the downward cumulative leaf area index (LAI): with where Q θ (z c ) is the prescribed heat flux at the top of the canopy layer z c and η is the extinction coefficient set to 0.6.Additionally, contributions by sinks/sources for q and s are considered in the canopy model by adding additional terms C ϕ to the scalar transport equations (see Eqs. 4-5): where ϕ ∈ {q, s} and c ϕ is a user-defined scalar exchange coefficient.ϕ c,0 and ϕ are the scalar concentrations at a leaf surface and in the surrounding air volume, respectively.This approach is based on the assumption that the scalar sink/source strength depends on the concentration gradient between the leaf surface and the surrounding air (e.g., Watanabe, 2004).

Recent applications
PALM simulations with the embedded canopy model were recently performed by Kanani et al. (2014c) to study the flow adjustment downstream of a transition from an unforested (clearing) to a forested surface patch.In this study the LES results were validated against multidimensional field and wind-tunnel data.In the high-resolution follow-up study of Kanani-Sühring and Raasch (2015), a detailed analysis of the turbulent scalar transport within the canopy layer was successfully performed for the first time by means of LES. Figure 9 shows exemplarily the flow at a forest edge, where an internal boundary layer developed above the forest due to the extended drag of the canopy.See also associated animations in Kanani et al. (2014a, b).

1-D model for precursor runs
The initial profiles of the horizontal wind components in PALM can be prescribed by the user by piecewise linear gradients or by directly using observational data.Alterna- where i ∈ {1, 2}, e, K h , K m and with MOST applied between the surface and the first vertical grid level, also L, u * as well as u i u 3 (where i ∈ {1, 2}).
The 1-D model assumes the profiles of θ and q v , as prescribed by the user, to be constant in time.The model solves the prognostic equations for u i and e: and The dissipation rate is parametrized by after Detering and Etling (1985).The mixing length is calculated after Blackadar (1997) as The turbulent fluxes are calculated using a first-order closure: where K m and K h are calculated as with the similarity functions h and m (see Eqs. 33 and 27, respectively), using the gradient Richardson number: Note that the distinction of cases in Eq. ( 126) is done with the value of Ri from the previous time step.
Moreover, a Rayleigh damping can be switched on to speed up the damping of inertial oscillations.The 1-D model is discretized in space using finite differences.Discretization in time is achieved using the third-order Runge-Kutta time-stepping scheme (Williamson, 1980).Dirichlet boundary conditions are used at the top and bottom boundaries of the model, except for e, for which Neumann conditions are set at the surface (see also Sect.2.5).

Technical realization
The model has been developed to run on Unix platforms.The PALM code is written according to the Fortran standard and split into several source code files.In the following Sect.4.1 we will give a condensed overview of the general code structure and program flow.The embedded LPM requires a special data structure, which has been recently changed, in order to handle billions of particles.We will thus devote Sect.4.2 to this new particle structure.
The PALM code is optimized for use on massively parallel architectures using the Message Passing Interface (MPI, e.g., Gropp et al., 1999) and Open Multiprocessing (OpenMP)3 (see Sect. 4.4).
The model steering is achieved by Fortran NAMELIST parameter lists that have to be provided by the user.The model operation will be described in detail in Sect.4.6.The code also offers an interface that can be used to add user code extensions and that will be described in detail in Sect.4.5.Data handling in PALM (see Sect. 4.7) is mainly based on the Network Common Data Form (netCDF)4 .Restart data are written in Fortran binary format.Finally, Sect.6 deals with the code management.

General code structure
The PALM source code layout follows similar coding standards that have been developed for other community models like the NEMO ocean dynamics model5 .Special emphasis is placed on providing extensive comment sections within the code in order to illustrate the functionality of specific model parts.
The source code is subdivided into a series of Fortran files.Most of them contain single subroutines only.These are called from the main PALM routine (palm.f90)and wherever needed.Each file features a header, containing a description and its history of modifications.The data handling between the subroutines is usually realized via Fortran modules defined in a separate file (modules.f90)instead of using parameter lists.The code contains several machine dependent segments, e.g., calls of routines from external libraries such as MPI, netCDF and FFTW6 , and which may not be available on some machines.These segments are activated/deactivated using C-preprocessor directives, which allow us to compile alternative parts of the code.
Three-dimensional arrays of prognostic variables (u = u, v = v, w = w, θ = pt, q v = q, s = s, e = e and Sa = sa) are stored at the last two time levels of the Runge-Kutta substeps.These arrays are declared as (e.g., the u-wind component) u(k, j, i) on the respective subdomain of each REAL, DIMENSION(:,:,:), POINTER :: u, v, which speeds up the swapping of time levels after each time step, as it is not required to move the data in the memory.
A condensed overview of the program flow of PALM is shown in Fig. 10.At the beginning of the model run (hereafter referred to as "job"), the model setup is read from a Fortran NAMELIST file that is provided by the user, and optionally additional files for large-scale forcing and topography.PALM allows for conducting so-called restart jobs and job chains, where long-lasting model runs can be split into smaller ones.This does not only meet the requirements of most supercomputing systems, it also provides the user the opportunity to modify the setup between runs, or, e.g., performing a set of parameter studies based on the same precursor run.For job chains, the current state of the model is saved as binary data at the end of the run and read as input for the subsequent restart run.After model initialization, possibly using a 1-D model precursor run (see Sect. 3.5), the time integration loop is executed until a termination is initiated.The latter might be caused by either the fact, that the desired simulation time has been reached, or by the need to initiate a restart of the job chain.The latter can be the case when the current job is running out of CPU time, or when the user has manually forced a restart.PALM can be used on cacheoptimized as well as on vector processors.Moreover, General Purpose Computing on Graphics Processing Units (GPGPU) can be used.Each machine architecture requires specially optimized code to be executed within computationally expensive loops of the prognostic equations.This is realized by a Fortran INTERFACE so that different code branches are executed in the prognostic_equations.f90 subroutine.
In most cases, the large computational grid with a very large number of grid points does not allow for processing the raw model data in a post-processing step, because then the input/output (I/O) time and the required hard disc space would easily exceed the available resources.Therefore, PALM calculates many standard quantities (e.g., variances, turbulent fluxes, and even higher-order moments) online during the run.Also, temporal averages of vertical profiles, cross sections, and 3-D data can be created this way.The user interface allows the user to easily extend this output (see Sect. 4.5).
After each time step we check whether data output (see also Sect.4.7) is required, depending on the user settings.

Particle code structure
This section will give a brief summary of the particle code structure and the changes carried out for PALM 4.0.These changes aim at reaching a significantly improved performance of the LPM in comparison to the previous versions described by Steinfeld et al. (2008) and Riechelmann et al. (2012).

END TYPE particle_type
Here, x, y, z, radius and age are some components of the derived data type of the intrinsic data type REAL.Several other components of all intrinsic data types (or even other derived data types) can be defined (e.g., location, velocity).In general, the particles are stored in an allocatable array of the derived data type

TYPE(particle_type), DIMENSION(:), & ALLOCATABLE :: particles
An element of particles defines a complete particle with its entire features, which can be accessed by the selector %, e.g., the radius and age of the particles by particles(n)%radius and particles(n)%age, respectively, where n is the index of a certain particle.In the old PALM version, all particles of the respective subdomain were stored in such a 1-D array.Since many quantities derived from the LPM depend solely on the particles located in a certain grid volume, e.g., the collision and coalescence process of the LCM (see Sect. 3.3.2),the order in which these particles are stored in memory determines heavily the CPU time for the LPM.In general, N 2 operations, where N is the number of all simulated particles, are needed to identify the particles located in the vicinity of another particle (see Riechelmann et al., 2012).In the previous versions of the LPM, this number of operations was reduced to N by sorting the particles according to the grid volumes in which they are located.However, due to the large number of O(10 6 ) particles stored, sorting was inefficient and also demanded a temporary array of the same size during sorting.
Therefore, from PALM 4.0 on, all particles are stored in a new array structure based on another derived data type named particle_grid_type, which contains, as a component, a 1-D array of the derived data type particle_type: and allocated using the same dimensions as used for a scalar of the LES model.In this way, all particles currently located in a certain LES grid volume are permanently stored in the particle array, assigned to this grid volume: Here, n_par is the number of particles located in the grid volume defined by the indices k, j, and i.The small www.geosci-model-dev.net/8/2515/2015/Geosci.Model Dev., 8, 2515-2551, 2015 size of this particle array at each grid volume (typically containing O(10 2 ) particles) allows the de-allocation and allocation of the particle array during the simulation, adapting its size to the number of required particles.This was (practically) not possible in the previous version of the LPM due to the large size of the full particle array (O(10 6 ) particles), which required a temporary array of the same size during reallocation.A temporary array is still required in the present version, but its size could be reduced by 4 orders of magnitude.However, as a particle moves from one grid volume to another, its data have to be copied from the 1-D array of the previous grid volume to the 1-D array of the new volume, and finally deleted from previous one, which consumes CPU time itself.Overall, the new particle structure reduces the CPU time of the LPM by 9 %, since sorting of particles is not required anymore.Moreover, large temporary arrays are no longer required, which increases the available memory by almost a factor of 2 (which doubles the hypothetical number of allocatable particles for future studies).
From PALM 4.0 on, the LPM features an optimized version of the tri-linear interpolation of LES data fields on the location of the particle.In general, the particles located in a certain grid volume are stored in an arbitrary order.Because of the staggered grid, indices of the eight surrounding grid points required for interpolation may differ from particle to particle (e.g., a particle in the lower left corner of a scalar grid box requires other data for interpolation than a particle in the upper right corner).This would require to re-calculate the respective index values for every new particle.By dividing every grid volume in eight subgrid boxes, two in every spatial direction, the same set of LES data can be used for all particles located in the same subgrid box (see example in Fig. 11).Therefore, the particles belonging to the same subgrid box are stored contiguously in memory reducing the CPU time substantially for the different subroutines depending on the interpolation of LES fields substantially (e.g., advection by 64 %, condensational growth by 50 %, whole LPM by 22 %), whereas the time needed for additional sorting increases the CPU time by only 3 %.
In summary, these optimizations reduce the CPU time of the LPM by 40 % and almost halve its memory demand.For simulations with hundreds of millions of particles, the LPM consumes more than 95 % of the overall CPU time of PALM and the memory demand of the particles is the limiting factor for these simulations (see high-end applications, e.g., Riechelmann et al., 2012;Lee et al., 2014;Sühring et al., 2015).The present version of the LPM now allows for larger numbers of particles.

Topography implementation
The topography implementation described in Sect.2.5.4 allows the use of 2-D topography height data in PALM.Currently, the topography data have to be provided within a rastered ASCII file.After reading and mapping of these Figure 11.Two-dimensional example of the optimized interpolation algorithm.Interpolating a scalar quantity (e.g., temperature) bi-linearly on a particle (blue dot) located in a certain LES grid box (thick black line) includes four values of LES data (red squares).Note that these values are the same for all particles located in the yellow subgrid box.Thus, by sorting all particles inside a grid box by their respective subgrid box, the indices required for interpolation need to be determined just once for all particles located in that subgrid box, and not repeatedly for all particles inside the entire grid box.This algorithm applies analogously for the velocity components located at the edges of the grid box.data to the horizontal grid in PALM, they can be directly incorporated into the standard loop structure of the Fortran code as a lower vertical index for all integration loops.Therefore, PALM employs two 2-D height index arrays (e.g., nzb_w_inner(j, i) and nzb_w_outer(j, i) for the velocity component w) to separate the domain into four regions based on the vertical index k (see Fig. The additional topography code is executed in regions B and C only.As the velocity components are defined on a different (staggered) grid than the scalar quantities (see Fig. 2), three extra pairs of 2-D height index arrays are defined: two for the horizontal velocities and one for scalar quantities (e.g., nzb_s_inner and nzb_s_outer for scalar quantities).

Parallelization and optimization
The parallelization of the code is achieved by a 2-D domain decomposition method along the x and y directions with equally sized subdomains.The method has not been changed in general since the formulation given by Raasch and Schröter (2001).In the following we will show that this method still allows for sufficient scalability on up to 50 000 processor cores (also referred to as processor elements, PEs).Ghost layers are added at the side boundaries of the subdomains in order to account for the local data dependencies, which are caused by the need to compute finite differences at these positions.The number of ghost layers that are used PALM depend on the order of the advection scheme, with three layers for the fifth-order Wicker-Skamarock scheme and one layer for the secondorder Piacsek-Williams scheme.Ghost layer data are exchanged after every time step.An anti-cyclic index order (i.e., (k, j, i)) is chosen for the 3-D arrays in order to speed up the data exchange.The anti-cyclic order guarantees that the ghost layer data are stored as long consecutive blocks in the memory, which allows us to access them in the fastest way.
The solution of the Poisson equation is complicated by the 2-D decomposition, because non-local data dependencies appear in all three directions, if the equation is solved with the FFT method (see Sect. 2.4).Due to the domain decomposition, the processor elements cannot perform standard FFTs along the x or y directions as their memory contains only a part of the full data.The general method to overcome the problem is to re-order the 3-D pressure/divergence data among the PEs using a transposition technique described in Raasch and Schröter (2001).The transposition is done using the MPI routine MPI_ALLTOALL and requires an additional re-sorting of the data in the local memory of the PEs before and after MPI_ALLTOALL is called.A similar method with MPI_ALLTOALL, replaced by MPI_SENDRECV, has been recently presented by Sullivan and Patton (2011).
Only local data dependencies appear if the Poisson equation is solved with the multigrid scheme.However, this method requires frequent exchange of ghost layers during every iteration step of the SOR solver, as well as for the restriction and prolongation step.The number of ghost layer data rapidly decreases for the coarser grid levels, so that the MPI transfer time may become latency bound.The domain decomposition affects the coarsening of grid levels at the point when the subdomain array of the current grid level contains only a single grid point along one of the spatial directions.In case that the number of grid points of the total (coarsened) domain allows further coarsening, array data from all subdomains are gathered and further processed on the main PE (hereafter PE 0), and results are redistributed to the other PEs in the respective prolongation step.However, this method is very inefficient and not used by default.Instead, coarsening is just stopped at that level, where subdomains contain only two grid points along at least one of the three spatial directions.The precision of the multigrid method depends on the iteration count.Using two W-cycles and two SOR iterations for each grid level typically reduces the velocity divergence by about 4 orders of magnitude, which turned out to be sufficient in most of our applications.With these settings, and for a numerical grid of about 2000 3 grid points, the multigrid method requires about the same time as the FFT Pois- The scaling behavior of PALM 4.0 is presented in Fig. 12a for a test case with 2160 3 grid points and the FFT Poisson solver.Tests have been performed on the Cray XC40 of the North-German Computing Alliance (HLRN).The machine has 1128 compute nodes, each equipped with two 12core Intel-Haswell CPUs, plus 744 compute nodes equipped with two 12-core Intel-Ivy Bridge CPUs, and an Aries interconnect.Additionally, runs with 4320 3 grid points have been carried out with up to 43 200 cores, starting with a minimum of 11 520 cores (see Fig. 12b).Runs with fewer cores could not be carried out as the data would not have fit into the memory.
Ideally, for a so-called strong scaling test, where the same setup is run on different numbers of cores, the wall-clock time of a run should decrease by a factor of 2 if the core number is doubled, which is shown in Fig. 12a and b (black lines).Figure 12b shows that the code scales very well up to 20 000 cores and is still acceptable for larger numbers (gray line).The decreasing scalability for larger core numbers is mainly caused by a performance drop of the MPI_ALLTOALL routine (brown line).In contrast, the pure computational part, i.e., the calculation of the prognostic equations (red line), scales up perfectly to the maximum number of cores.
While the general parallelization methods used in version 4.0 do not differ from the first version, a large number of code optimizations have been carried out since then.Only the most important ones shall be briefly discussed at this point, namely the scalar optimization for different processor types; and overlapping of computation and inter-processor communication.
The original PALM code calculated the different contributions to the tendency terms (i.e., advection, buoyancy, diffusion, etc.) and the final prognostic equation for each prognostic quantity in separate 3-D loops over the three spatial directions.In case of large 3-D arrays that do not fit into the cache of cache-based processors like Intel-Xeon or AMD-Athlon, the array data have to be reloaded from the main memory for each 3-D loop, which is extremely time consuming.For this reason, the outer loops over i and j have been extracted from each 3-D loop, now forming a 2-D loop over all tendencies and prognostic equations, e.g., DO i = nxl, nxr DO j = nys, nyn DO k = nzb+1, nzt !--advection term tend(k,j,i) =... ENDDO DO k = nzb+1, nzt !--diffusion term tend(k,j,i) = tend(k,j,i) +...

ENDDO ... ! further tendencies ENDDO ENDDO
In this way, array data used in the first loop can be reused from the cache by the following loops, since the size of 1-D data columns along k is usually small enough to fit completely into the cache.Figure 12a shows that this loop structure gives a performance gain for the computation of the prognostic equations of 40 % compared with the 3-D loop structure.The overall performance of the code improves by about 15 %.Nonetheless, both methods are implemented in the code in separate branches, since the 3-D loops give a much better performance on vector-based hardware like NEC-SX or accelerator boards (e.g., Nvidia K20), since they allow the compilers to generate much longer vectors than for the single loops along the z direction.
From Fig. 12b it is evident that for large-size setups with a huge number of grid points and more than a few thousand PEs, the solution of the Poisson equation dominates the time consumption of the simulation.This is because the FFT and the data transpositions with MPI_ALLTOALL scale less well than the other parts of the code.The FFT time increases nonlinearly with N log(N ), where N is the total number of grid points along x or y.The MPI_ALLTOALL time also increases nonlinearly with the number of cores.While the scaling problem is not easy to address, the Poisson solver can be sped up by overlapping the computation (FFT and tri-diagonal equation solver) and the communication among the PEs (MPI_ALLTOALL).PALM solves the Poisson equation in different steps in a sequential order.So far, first, the complete 3-D data subdomain array is transposed, followed by the FFT along x, followed by the next transposition, followed by the FFT along y, etc.The FFT calculation cannot start unless the complete 3-D array is transposed.Now, in PALM 4.0, the 3-D arrays are processed in 2-D slices (e.g., in z-x slices for the transposition from z to x).The slices are processed in a loop along the remaining direction (which is y in this case) with alternating transfer (MPI_ALLTOALL of a 2-D slice) and computation (FFT of this slice).This allows some overlapping of the computation and communication parts because of the following reason: on the Cray XC30 system mentioned above, for example, every compute node is populated with two processor dies, containing an Intel CPU with 12 cores each.This allows 24 MPI processes on the node.However, every node is equipped with two MPI channels only.If a data transfer is issued on all MPI processes simultaneously, the transfer cannot be done totally in parallel because individual MPI processes have to wait for the free channel.This behavior allows computation and transfer in parallel.For example, if PE 0 is the first to get a free MPI channel, it can start computation as soon as the transfer has been finished.All other PEs consecutively start computation after transfer.When the last transfer is finished, PE 0 has already finished computation and can immediately start the next transfer.The fact that not all PEs have simultaneous access to an MPI channel allows for parallel transfer and computation without any extra programming effort, such as asynchronous MPI_ALLTOALL or doing the transfer using hyperthreads.
Breaking up the workload as described above also improves performance due to better cache utilization, because the transposed data are still in the cache when needed by the FFT.The differences in performance between the sequential and the overlapping method are displayed in Fig. 12a.The FFT Poisson solver is sped up about 15 % for up to 2500 cores in case that overlapping is used.For higher core numbers, the current realization of overlapping becomes inefficient because the data chunks handled by MPI_ALLTOALL get too small.The latency thus dominates the transfer time.In order to overcome this problem, several 2-D slices can be transposed at the same time.We will implement this technique in one of the next PALM revisions in the near future.
Besides the parallelization by domain decomposition, PALM is also fully parallelized on the loop level using the shared-memory OpenMP programming model and can be run in so-called hybrid mode, e.g., with two MPI processes and 12 OpenMP threads per MPI process started on each node.
A typical PALM setup uses 2-D domain decomposition with one MPI process on every processor core.The hybrid mode normally does not give advantages, because the OpenMP parallelization creates another synchronization level so that the total computation time will not decrease (typically it even increases by a few percent).Anyhow, for the following special cases hybrid parallelization may have some advantages: -with many processor cores per CPU, a 1-D domain decomposition plus OpenMP parallelization may show better performance because the number of transpositions is reduced from 6 to 2, -since the hybrid mode enlarges the subdomain sizes (because of fewer MPI processes), load imbalance problems caused by, e.g., inhomogeneously distributed buildings or clouds may be reduced, because larger subdomains provide a better chance to get about the same number of buildings/clouds per subdomain, and -for the multigrid Poisson solver, the hybrid mode allows us to generate more grid levels on the subdomains because of their larger size, which may help to improve the solver convergence.
Since the speedup behavior depends on many factors like the problem size, the virtual 2-D processor grid, the network, etc., the actual speedup is difficult to predict and should be tested individually for every setup.
The data exchange in the case of coupled oceanatmosphere runs is realized with MPI.There are two methods available.The first one, based on MPI-1, splits the default communicator (MPI_COMM_WORLD) into two parts, with the respective number of PEs assigned to the atmosphere and the ocean part as given by external parameters.The second method starts the respective number of MPI tasks for the atmosphere and the ocean independently (e.g., by two calls of mpiexec) and uses MPI-2 routines MPI_COMM_CONNECT and MPI_COMM_ACCEPT to couple them.

User interface
PALM offers a flexible interface that allows for adding userspecific calculations and code extensions.Also, the data output of user-defined quantities, such as 2-D/3-D data as well as time series, vertical profiles and spectra can be accomplished in a convenient manner.The implementation of such user-defined code is realized in the form of subroutine calls, which are made at several places in the model code.These subroutines have predefined names.Some of the entry points for the subroutine calls are shown in Fig. 10.Their basic versions are a part of the default model code and labeled as user_ *** .f90.These basic versions perform no actions and thus act as pure templates.For example, the subroutine user_init.f90reads By default, quantities in the time series and horizontally averaged vertical profile data output always refer to the total model domain (see also Sect. 4.7).The user interface, however, allows for defining up to nine user-defined (horizontal) subdomains for which the output of time series and profiles is automatically added to the output data.Besides the output of profiles and time series for user-defined horizontal domains, PALM offers a very flexible masked data output, controlled by a set of NAMELIST parameters.This feature allows us to output quantities at different mask locations, e.g., 3-D volume data or 2-D cross sections of arbitrary extension within the model domain, or 0-D or 1-D data at any position and of any amount.

Model operation
The compilation and execution of PALM is controlled via a Unix shell scripts named mbuild and mrun, using bash/ksh syntax.mbuild compiles the default code using the Unix make mechanism.Compiler options, including Cpreprocessor directives and required library paths (e.g., for netCDF or FFT), are given in a configuration file (default name .mrun.config).The configuration file allows for setting different compilers and options in separate blocks.The compiled source code (object files) is stored in a socalled depository folder (one folder for each option block).mrun takes care of the compilation (main program and user interface files only) and job submission/execution of PALM, including the handling of I/O files.The mrun command has a number of options to control the program execution.The execution is also controlled by the configuration file, which provides machine-and user-specific settings such as compiler options and library paths (see above), and I/O file locations.Basically, mrun performs the following tasks in sequential order: 1. create a unique temporary working directory for the job, 2. copy input files and user-defined code required for the job to the temporary directory, 3. copy pre-compiled PALM routines to the temporary directory, 4. compile the main program using the precompiled object files and the user code, 5. execute the program, 6. copy the model output files from the temporary directory to a directory specified by the user, 7. delete the temporary working directory.
Since each job runs in a unique temporary directory (see task 1), several jobs can run at the same time without interfering with each other.The I/O files are handled (tasks 3 and 6) via so-called file connection statements, which allow us to manage these files in a flexible way and to keep them in a well-organized folder structure.A typical file connection statement for an input file reads where the first column gives the local filename in the temporary working directory that must correspond to the filename in the OPEN statement in the PALM source code.The second column provides a file attribute (where in means that it is an input file), and the third column is the activating string that defines whether this file connection statement is carried out in the respective job.The fourth column gives the folder name where the permanent (input) file is provided by the user.Finally, the sixth column gives the suffix of the permanent file.
The full name of the permanent file results from the folder name, the suffix, and the value of the mrun option -d, which defines the so-called basename of all files handled by mrun; e.g., the mrun call mrun -d example_cbl -r "d3#"... defines the filename to be ~/palm/current_version/JOBS/INPUT/ example_cbl_p3d and which will be copied to PARIN in the temporary working directory (task 2) due to the setting of the activation string with the option -r.Besides, it is possible to organize jobs using the string $fname in the folder name column of the connection statement: Here, the value of $fname is given by the -d option during the mrun call (here example_cbl) and all job files can be stored accordingly.
The mrun script never replaces or overwrites existing files.New so-called cycle numbers are created instead.For example, the file example_cbl_d3d (3-D data) has been created within a first model run.Then a second call of mrun and subsequent model execution will create a new file, named example_cbl_d3d.1,etc.
While some I/O operates on single files only (e.g., output of CPU measurements), other data (e.g., restart data) I/O is done by each core separately.In such cases, filenames provided by the file connection statements are interpreted as directory names.Each core then opens a file, named _###### in the respective directory, where the hashes stand for a sixdigit integer, declaring the rank of the MPI process in the MPI communicator in PALM.
Each simulation setup can be complemented by a separate set of user interface routines that replace the template files in the default code at compile time (see task 2).In this way, PALM executables will be dynamically created for each setup, based on the same default code, but with unique user code extensions.This also has the advantage, that it is generally possible to update the PALM version without the need of adapting own user-defined code.User interface routines for different setups can be stored in different folders, which are accessed by mrun using the basename mechanism as for I/O file described above.
At the beginning of task 4, various checks are performed on the parameter files and the provided user interface.Therewith, illegal model settings are trapped and reported to the user with a unique error message identifier.Moreover, several runtime and netCDF errors are captured by PALM in this way.A comprehensive online database provides additional information on each error message identifier (see Sect. 6).
Furthermore, mrun can be used to generate batch jobs on local and remote hosts, and it also controls the automatic generation of restart jobs/job chains.For convenience, an optional graphical user interface has been developed as a wrapper for mrun, called mrunGUI, providing an intuitive access to the mrun script (see Fig. 13).

Data handling
Due to the enormous number of data that come along with computationally expensive LES, the data handling plays a key role for the performance of LES models and for data analysis during post-processing.PALM is optimized to pursue the strategy of performing data operations to a great extent online during the simulation instead of to postpone these operations to the post-processing.In this way, the data output (e.g., of huge 4-D data, or temporal averages) can be significantly reduced.In order to allow the users to perform their own calculations during runtime, the user interface offers a wide range of possibilities, e.g., for defining user-defined output quantities (see Sect. 4.5).
PALM allows data output for different quantities as time series (horizontally averaged), vertical profiles, 2-D cross sections, 3-D volume data, and masked data (see Sect. 4.5).All data output files are in netCDF format, which can be processed by different public domain and commercial software.NetCDF data can also be easily read from Fortran programs, provided that a netCDF library is available.The netCDF libraries currently support three different binary formats for netCDF files: classic, 64-bit offset, and netCDF-4.The latter was introduced in netCDF version 4.0 and is based on the HDF57 data format.PALM is able to handle all three netCDF formats and also supports parallel I/O for netCDF-4.
For visualization of the netCDF data generated by PALM, several NCAR Command Language (NCL)8 scripts are available that allow a quick overview of the simulation data.For advanced visualizations, we have developed a tool that converts PALM data into the vdf data format of the VAPOR9 open-source software (Clyne et al., 2007).Animations using PALM data and VAPOR have been recently published by Maronga et al. (2013a), Knoop et al. (2014), andKanani et al. (2014a, b).

Past and current research fields
PALM has been applied for numerous boundary-layer research studies over the years.For example, coherent structures in the convective ABL have been simulated by Raasch and Franke (2011) (dust devil-like vortices; see also visualization, Maronga et al., 2013a) and under neutral conditions at a forest edge by Kanani et al. (2014c) and Kanani-Sühring and Raasch (2015) (using the canopy model).Classic vertical profiles of temperature, humidity, fluxes, structure parameters, and variances, as well as horizontal cross sections and turbulence spectra for the convective ABL, were shown, e.g., by Maronga and Raasch (2013) and Maronga et al. (2013b).Moreover, Hellsten and Zilitinkevich (2013) used PALM to investigate the role of convective structures and background turbulence in the ABL.The model has also been applied for the stable boundary layer in the scope of an LES intercomparison (Beare et al., 2006).
The possibility of using Cartesian topography as a surface boundary condition (see Sect. 2.5.4) has facilitated the simulation of the urban boundary layer and study of the flow around buildings (first validation against wind tunnel data from Martinuzzi and Tropea (1993) in Letzel et al. (2008)).The research fields ranged from development of better urban parametrization schemes to the investigation of the ventilation at pedestrian level in densely built-up cities.The flow around street canyons and idealized buildings has been the subject of several studies (e.g., Inagaki et al., 2011;Abd Razak et al., 2013;Park et al., 2012;Park and Baik, 2013;Yaghoobian et al., 2014).With increasing computational resources, it also has become possible to use PALM to simulate entire city quarters (Letzel et al., 2012;Kanda et al., 2013).Lately, PALM has also been used to simulate the entire city of Macau with a model domain of about 6 km × 5 km and a fine spacing of only 1 m (see Keck et al., 2012;Knoop et al., 2014).In addition to applications for the urban boundary layer, the topography option was used to study atmospheric Kármán vortex streets (Heinze et al., 2012).
Investigations on cloud-topped boundary layers have been another core area of the research with PALM.Cloudy boundary layers have been simulated using bulk cloud microphysics for cold-air outbreaks by Gryschka et al. (2008) as well as for the analysis of second-order budgets in cloudtopped boundary layers for BOMEX and DYCOMS-II 10 (see Stevens et al., 2005) experiments by Heinze et al. (2015).Recently, the embedded LCM has been employed for studying the effect of turbulence on the droplet dynamics and growth (Lee et al., 2014;Riechelmann et al., 2015), and for investigating the entrainment (of aerosols) at the edges of cumulus clouds (Hoffmann et al., 2015(Hoffmann et al., , 2014)).

Current and future developments
The most serious model modification in the near future will be related to the change from the currently used incom- 10 The Second Dynamics and Chemistry of Marine Stratocumulus field study pressible conservation equations (see Sect. 2.1) to an anelastic approximation.In the anelastic equations the density of the fluid can vary with height, so that it will be possible to study both shallow and deep convection and therefore will also allow a better representation of larger-scale processes.This change will be accompanied by adding the ice phase for clouds.In the long term, we plan to add the option of a fully compressible code.On the one hand, this will render the pressure solver unnecessary and thus help to overcome the transposition bottleneck for core numbers > 100 000 (see, e.g., Fig. 12a).On the other hand, the high propagation velocity of sound waves requires the implementation of a time-splitting algorithm with a considerably smaller time step than in the incompressible system (see, e.g., Wicker and Skamarock, 2002).
The research group of Matthias Mauder at the Karlsruhe Institute of Technology is currently working on a vertical grid nesting of PALM similar to Sullivan et al. (1986).The selfnesting will allow using very high grid resolutions within the atmospheric surface layer, and relatively low resolutions above, which would reduce the computational load for investigations of the surface layer by up to 90 %.After sufficient validity checks, the nesting technique will be incorporated into the default code.Also, a full 3-D two-way self-nesting of PALM is currently being developed by Antti Hellsten at the Finnish Meteorological Institute.
Furthermore, wind turbine parametrizations developed by the research group of Detlev Heinemann at the University of Oldenburg will be included in the default code in the near future.
The rapid radiative transfer model (for global models) (RRTMG, Clough et al., 2005) has been recently coupled to PALM to allow a better representation of radiative effects in clouds and during nighttime.In order to allow feedback between radiative effects and the surface/soil, a land surface model (LSM) has already been implemented.The LSM is a modified version of the Tiled European Centre for Medium-Range Weather Forecasts Scheme for Surface Exchanges over Land (TESSEL, van den Hurk et al., 2000;Balsamo et al., 2009) and the derivative implemented in DALES.It consists of a solver for the energy balance and a four-layer soil scheme, taking into account soil properties and vegetation.Both the RRTMG implementation and the LSM will be part of the next PALM release.
In order to allow for a sufficient representation of SGS turbulence when using relatively coarse meshes, we intend to implement the dynamic Smagorinsky turbulence closure model after Esau (2004).
Finally, we plan to add an option to use viscous topography for urban LES and complex terrain after Mason and Sykes (1978), where topography in PALM will be represented by grid volumes with infinite viscosity instead of using solid elements.Unlike the present implementation, where grid volumes can either be set to topography or fluid, sloping surfaces will be better represented by adjusting the viscosity of the respective grid volumes.

Future perspectives
At the moment, LES remains a pure research tool, which can be used to tackle fundamental and initial research questions, and that often requires the world's largest supercomputers.In the mid term (next 5-10 years), however, further increasing capacities of supercomputers and alternative hardware, such as multiple GPUs and the Intel MIC coprocessor computer architecture, might alter the situation.
At present we are porting the PALM code to use it on multiple Nvidia GPUs.Instead of using GPU programming models such as OpenCL11 or CUDA12 , which requires re-writing of the existing code (see, e.g., Schalkwijk et al., 2012, who have ported the DALES code), we have chosen the new Ope-nACC13 programming model.Like OpenMP, OpenACC is based on directives that are placed, e.g., in front of loops and interpreted as comment lines by standard compilers.This allows us to use the same code on any kind of hardware, avoiding redundant model development in completely different branches.In order to minimize the time-consuming data transfer between the host (CPU) memory and the device (GPU) memory, almost the complete PALM code is run on the GPU and data are only transferred for I/O purposes.PALM 4.0 is able to run on a single GPU, but only some basic PALM features have been ported so far (FFT Poisson solver, dry prognostic equations).This version has been selected to be part of the SPECACCEL benchmark14 .Multiple-GPU usage is currently implemented using so-called CUDA-aware MPI implementations, which allow us to send data from the GPU memory directly to the network adapter without staging through the host memory.
Within the foreseeable future, the LES technique will become a rewarding alternative for operational forecasts, particularly of local near-surface high-risk conditions such as strong wind gust, dense fog, or pollutant dispersion in urban environments.End-users will be airport operators, city planners, and consultants that currently rely on information from mesoscale models.LES might also be employed for improving the nowcast of convective shower cells.Moreover, the site assessment, which usually involves long-term measurements, is currently a major expense factor during planning of wind parks.The usage of LES might shorten this procedure significantly and thus reduce costs.In order to enable such applications, however, it will be necessary to achieve a highly optimized parallelization of the model.This is particularly true as the number of processors of supercomputer clusters will further increase.
While past LES research has mainly focused on the convective and neutral boundary layer, we observe increasing interest in the stable regime (e.g., Stoll and Porté-Agel, 2008;Zhou and Chow, 2014).This trend will continue in the future and allow more rigorous investigations of the stable and very stable boundary layer, where the largest eddies are only of a size of a few meters.This trend is surely linked to increasing computational power and hence the possibility of using fine enough grid spacings in LES of 2 m and below.Also, the transition periods in the afternoon and early morning will be the object of future research (e.g., Edwards et al., 2014).The optimization and scalability of PALM and future developments like the vertical self-nesting of PALM or the multiple GPU adaptation that will be available in the near future will support the usage of PALM for such applications.

Code availability
The PALM code is freely available 15 and distributed under the GNU General Public License v3 16 .For code management, versioning and revision control, the PALM group runs an Apache Subversion 17 (svn) server at IMUK.The PALM code can be downloaded via the svn server, which is also integrated in a web-based project management and B. Maronga et al.: The Parallelized Large-Eddy Simulation Model bug-tracking system using the software Trac18 .In this way, PALM users can use the web interface to browse through the code, view recent code modifications, and to submit bug reports via a ticketing system directly to the code developers.Furthermore, a model documentation, a detailed user manual as well as an online tutorial are available on the Trac server and are constantly kept up to date by the PALM group.Code updates and development are generally reserved to the PALM group in order to keep the code structure clean, consistent, and uniform.However, we encourage researchers to contact us for collaborative code development that might be suitable to enter the default PALM code.We also appreciate suggestions for future PALM developments.

Summary
In this technical overview paper, we described the current version 4.0 of the well-established LES model PALM that has been developed at Leibniz Universität, Hannover, Germany.PALM has been successfully applied over the last 15 years for a variety of boundary-layer research questions related to both turbulence in the atmospheric and oceanic boundary layer.The aim of this paper was to create a detailed, yet condensed, reference work of the technical realization and special features of PALM.
It was shown that the model is highly optimized for use on massively parallel computer architectures, showing a high performance on up to 50 000 processor cores.Owing to the high scalability, the model is suitable for carrying out computationally expensive simulations for large-domain and very high grid resolutions.Moreover, PALM features embedded models, namely a LPM/LCM for simulating passive particles as well as explicit cloud droplets using the concept of superdroplets.Alternatively, a two-moment microphysics scheme is implemented for studying boundary-layer clouds.A simple canopy model allows for studying the effect of vegetation on the boundary layer.Furthermore, a Cartesian topography is implemented that is most useful for simulations of the urban boundary layer.A surface coupling can be used that allows us to resolve feedback processes between the atmospheric and oceanic versions of PALM.
Furthermore, we gave an overview of the technical realization.This included the general Fortran code structure, the structure of the Lagrangian particles, which requires special treatment, as well as parallelization and optimization on supercomputer clusters and novel hardware and techniques such as GPGPU.
We also described planned model developments, such as the change to an anelastic approximation that will allow us to simulate deep convection, to include self-nesting of the model, and a full coupling of PALM with land surface and radiation models.
Finally, we would like to encourage interested researchers in both the atmospheric and oceanic boundary-layer communities to try out PALM.The model can be freely downloaded from http://palm.muk.uni-hannover.de and used under the GNU GPL.The PALM web page does not only provide the model code and full documentation, it also offers an extensive tutorial section allowing a quick introduction to the model usage.

Figure 1 .
Figure 1.The PALM logo introduced in version 4.0.
closure includes a prognostic equation for the SGS-TKE:

Figure 3 .
Figure 3. Schematic figure of the turbulence recycling method used for generation of turbulent inflow.The configuration represents exemplary conditions with a built-up analysis area (brown surface) and an open water recycling area (blue surface).The blue arrow indicates the flow direction.

Figure 4 .
Figure 4. Sketch of the 2.5-D implementation of topography using the mask method (here for w).The yellow and red lines represent the limits of the arrays nzb_w_inner and nzb_w_outer as described in Sect.4.3, respectively.

Figure 5 .
Figure 5. Snapshot of the absolute value of the 3-D rotation vector of the velocity field (red to white colors) for a simulation of the city of Macau, including a newly built-up artificial island (left).Buildings are displayed in blue.A neutrally stratified flow was simulated with the mean flow direction from the upper left to the bottom right, i.e., coming from the open sea and flowing from the artificial island to the city of Macau.The figure shows only a subregion of the simulation domain that spanned a horizontal model domain of about 6.1 × 2.0 × 1 km 3 , and with an equidistant grid spacing of 8 m.The copyright for the underlying satellite image is held by Cnes/Spot Image, Digitalglobe.For more details, see the associated animation(Knoop et al., 2014).
contains a snapshot from such a benchmark simulation, where three continuous days were simulated.The figure shows the simulated clouds including precipitation events on 26 April 2013 during a frontal passage.Moreover, cloud microphysics have been recently used for investigations of shadow effects of shallow convective clouds on the ABL and their feedback on the cloud field (not yet published).

Figure 6 .
Figure6.Snapshot of the cloud field from a PALM run for three continuous days of the HD(CP) 2 Observational Prototype Experiment.Shown is the 3-D field of q c (white to gray colors) as well as rainwater (q r > 0, blue) on 26 April 2013.The simulation had a grid spacing of 50 m on a 50 × 50 km 2 domain.Large-scale advective tendencies for θ l and q were taken from COSMOE-DE (regional model of the German Meteorological Service, DWD) analyses.The copyright for the underlying satellite image is held by Cnes/Spot Image, Digitalglobe.

Figure 7 .
Figure 7. Illustration of (a) the collision of a super-droplet with a super-droplet smaller in radius and (b) internal collisions of a single super-droplet.Blue (red) circles indicate super-droplets before (after) collision.Weighting factor (A) and bulk mass (m) are denoted in arbitrary units.The radius of the colored circle indicates the volume-averaged radius of droplets represented by the superdroplet.

Figure 8 .
Figure 8. Distribution of droplets (colored dots) inside a shallow cumulus cloud simulated with PALM.The figure shows a vertical cross section through the 3-D cloud.The color and size indicate the droplet's radius.The cloud has been triggered by a bubble of warm air (similar to Hoffmann et al., 2015).A grid spacing of 20 m was used and about 225 million particles were simulated in total.

Figure 9 .
Figure 9. Snapshot of the absolute value of the 3-D rotation vector of the velocity field above a forest canopy downstream of a grassland-to-forest transition (forest volume marked by green isosurface).Pink and yellow colors illustrate strong and weak turbulence, respectively.A neutrally stratified open-channel flow was simulated with the mean flow direction from left to right, i.e., perpendicular to the windward forest edge, using an equidistant grid spacing of 3 m.The figure shows only a subregion of the simulation domain (2 × 1 × 0.4 km 3 ).

B.
Maronga et al.: The Parallelized Large-Eddy Simulation Model processor, including ghost point layers (nbgp = 3 by default) for data exchange between the processors (see also Sect.4.4): u(nzb:nzt+1,nysg:nyng,nxlg:nxrg), with nzb and nzt being the domain bounds of the bottom and top of the model.The lateral subdomain bounds (including ghost layers) are given by nysg = nys -nbgp, nyng = nyn + nbgp, nxlg = nxl -nbgp, nxrg = nxr + nbgp,with nys, nyn, nxl, and nxr being the true subdomain bounds in the south, north, west and east directions, respectively.For optimization, most of the 3-D variables are declared as pointers, e.g., for u and v:
TYPE particle_grid_type TYPE(particle_type), DIMENSION(:), & ALLOCATABLE :: particles END TYPE particle_grid_typeNote that the individual particle features are still accessible as components of particles.An allocatable threedimensional array of particle_grid_type is defined: 4): A. 0 ≤ k < nzb_w_inner, grid points within obstacles or in the ground that are excluded from calculations, B. nzb_w_inner ≤ k < nzb_w_outer, grid points next to vertical walls, where wall-bounded code is executed, C. k = nzb_w_inner = nzb_w_outer, grid points next to horizontal walls, where wall-bounded code is executed, and D. all other k, grid points in free fluid.

Figure 12 .
Figure 12.Scalability of PALM 4.0 on the Cray XC40 supercomputer of HLRN.Simulations were performed with a computational grid of (a) 2160 3 and (b) 4320 3 grid points (Intel-Ivy Bridge CPUs).(a) shows data for up to 11 520 PEs with cache (red lines) and vector (blue lines) optimization and overlapping during the computation (FFT and tri-diagonal equation solver, see Sect.4.4) enabled (dashed green lines).Measurement data are shown for the total CPU time (crosses), the prognostic equations (circles), and for the pressure solver (boxes).(b) shows data for up to 43 200 PEs and with both cache optimization and overlapping enabled.Measurement data are shown for the total CPU time (gray line), pressure solver (blue line), prognostic equations (red line), as well as the MPI calls MPI_ALLTOALL (brown line) and MPI_SENDRCV (purple line).
and can be extended according to the needs of the user.

Table 1 .
List of general model parameters.

Table 2 .
List of general symbols.

Table 3 .
List of ocean model parameters.

Table 4 .
List of cloud physics parameters and symbols.

Table 5 .
List of LPM symbols and parameters.

Table 6 .
List of canopy model parameters and symbols.
ϕ kg m −3 s −1 , kg kg −3 s −1 Canopy tendency for scalar quantities (s, q) significantly damped.A stationary state of the wind profiles can thus be provided much faster in the 3-D model.The arrays of the 3-D variables are then initialized with the (stationary) solution of the 1-D model.These variables are u i in neutral stratification, where inertial oscillations can persist for several days in case that non-balanced profiles are used for initialization.By employing the embedded computationally inexpensive 1-D model with a Reynolds-averagebased turbulence parametrization, these oscillations can be Geosci.Model Dev., 8, 2515-2551, 2015 www.geosci-model-dev.net/8/2515/2015/