Geospatial input data for the PALM model system 6.0: model requirements, data sources, and processing

The PALM model system 6.0 is designed to simulate microand mesoscale flow dynamics in realistic urban environments. The simulation results can be very valuable for various urban applications, for example to develop and improve mitigation strategies related to heat stress or air pollution. For the accurate modelling of urban environments, realistic boundary conditions need to be considered for the atmosphere, the local environment, and the soil. The local environment with its geospatial components is described in the static driver of the model and follows a standardized format. The main input param5 eters describe surface type, buildings and vegetation. Depending on the desired simulation scenario and the available data, the local environment can be described at different levels of detail. To compile a complete static driver describing a whole city, various data sources are used, including remote sensing, municipal data collections and open data such as OpenStreetMap. This manuscript shows how input data sets for three German cities were derived. Based on these data sets, the static driver for PALM can be generated. As the collection and preparation of input data sets is tedious, prospective research aims at the 10 development of a semi-automated processing chain to support users in formatting their geospatial data. Copyright statement. TEXT


Introduction
Nowadays, computational fluid dynamic models are increasingly used to simulate the atmospheric flow within urban environ- 15 ments, e.g., to develop and improve mitigation strategies for heat stress (e.g. Sharma et al., 2018) or air pollution scenarios (e.g. Kurppa et al., 2018). In order to draw a realistic picture of the thermodynamic and dynamic conditions within urban environments, it is required to consider and sufficiently represent all the relevant physics on the micro-and meso-scale, as well as realistic boundary conditions to reflect the real-world conditions. Besides realistic initial and boundary conditions for 2 Input data requirements by PALM The geospatial input data for PALM is organised hierarchically, with a set of minimum requirements and further optional input data, depending on the objective of the simulation and available input data. This section gives a description of the input data requirements of PALM in the different situations as well as a short description of the required parameters.

Requirements and hierarchy
All geospatial input data is provided by the user in a netCDF driver file (hereafter referred to as static driver) that comprises all static (i.e., time-invariant) spatial information as well as metadata according to the so-called PALM input data standard (PIDS, see Appendix A). The PIDS inherits most of the netCDF Climate and Forecast Metadata Conventions Version 1.7 (CF-1.7) 1 and is therefore also conform with the conventions of the Cooperative Ocean/Atmosphere Research Data Service (COARDS) 2 . Depending on the setup (e.g., only dynamic flow or fully thermodynamic simulation with interactive surfaces) there is a minimum set of mandatory variables and several optional ones that need to be included in the static driver.
The initialization in PALM follows a multi-step approach, depending on the given level of detail (LOD) of each variable as provided in the static input file. In absence of a static driver, i.e., the lowest level of detail, LOD 0, a horizontally homogeneous surface is initialized based on settings using Fortran NAMELIST parameters, e.g., homogeneously vegetated surfaces 5 and surface properties in the land-surface model (see Maronga et al., 2019). In LOD 1, surface information is passed to PALM via two-dimensional fields in the static driver. Table 12 gives an overview of all LOD1 fields that can be read by PALM. For simulations without thermodynamics, i.e., when no interactive surface schemes are used, only the fields zt (terrain height) and buildings_2d (building height) are used for initialization and at least one of these fields must be provided. Additionally, the field for the building identifier (ID) building_id must be set when zt is used in order to guarantee a correct mapping 10 of buildings on the terrain (see Sect. 5.2 for details). As the static driver contains rastered data, information about objects that extent over several grid volumes is lost. By using an ID field this information can be retained. Note that building_id is thus also needed when the building-based indoor model of PALM is switched on (see Maronga et al., 2019). For cases with interactive surfaces, each surface element is classified according to its treatment, i.e., default-(i.e., non-interactive), land surface-or urban-type (i.e., building). In setups without interactive surfaces, all surface elements are classified as default-type. In setups 15 with interactive surfaces, a surface classification using the fields vegetation_type, water_type, pavement_type, and building_type is utilised (see Sect. 2.6). Currently, each surface pixel (y, x) must be assigned to one of the aforementioned types. In the future, PALM will also allow a tile approach so that multiple types can be present in one grid box, which will be particularly useful when using coarser grid spacings (> 10 m), where neglecting sub-pixel heterogeneity is no longer adequate. The tile approach will be realized by specifying the individual portions via the field surface_fraction, which 20 is already recognized by PALM.
By setting the surface types, all required parameters for the surface treatment are automatically set to default values. Note that pavement-and land vegetation-type surface require the setting of soil_type at the respective pixels. When using the surface classification, a default albedo type is automatically set for each pixel depending on the chosen surface classification.
This can, however, be overwritten using the optional field albedo_type. Tables A1-A7 in the Appendix give an overview 25 of the classifications used and the parameters automatically set when using LOD1.
Based on the LOD1 classification of each surface pixel, the static driver allows to overwrite all or selected parameters that were automatically set by the LOD1 input data (for example roughness lengths, surface emissivity, etc., see Tables A1-A7 in the Appendix). For each * _type field in LOD1 there is thus a respective * _pars field, representing LOD2 data (see Table 13). Note that LOD2 can only be used when simultaneously having specified LOD1 data. The * _pars fields then 30 can contain fill values except for those locations where the data should be overwritten by LOD2 input data. Additionally, LOD2 offers the NC_BYTE field buildings_3d, which can be used to specify three-dimensional building structures including overhanging structures, thoroughfares, and bridges (see Sect. 2.5). Unlike for the other * _pars fields, the LOD1 data (i.e., buildings_2d) is not used if LOD2 data (i.e., buildings_3d) is present in the data. Furthermore, the field root_fraction can be set in order to specify a different vertical root distribution in the soil model of the parameterized vegetation in the land surface model. While LOD2 is limited to a localized setting of individual surface or material properties based on location (y, x) only, LOD3 and LOD4 settings (see Tables. 14 and 15) allow an even more detailed specification of building parameters. Note that in 5 LOD4, the input data no longer depends on the rastered PALM grid, but is arranged in a one-dimensional array of size ns, where ns is the number of surface elements on the model domain. For each surface element, the user then has to specify the position of the surface element in the PALM domain space, i.e., (z(1 : ns), y(1 : ns), x(1; ns)) as well as the orientation of surface elements in terms of azimuth and zenith angles (azimuth(1 : ns) and zenith(1; ns), respectively) in one-dimensional fields. 10 Additionally, three-dimensional fields of leaf-area density, LAD (lad), and basal-area density, BAD (bad), as well root fractions (root_fraction_resolved) and a tree ID (tree_id) can be used to set-up resolved-scale plant canopies (see Sect. 5.3).

Geo-referencing
Various model components such as the radiation parametrization, the representation of the Coriolis force or geo-referencing 15 of model output require information about the geo-location of the grid cells of PALM. Therefore, the static input file must contain information about the longitude and latitude, as well as the Easting and Northing UTM coordinates of the lower-left corner of the model domain. Furthermore, reference height of the lowest model grid point as well as the rotation angle of the model domain must be provided, which is especially important to setup virtual measurement positions and trajectories within the model according to 'real-world' measurements (Maronga et al., 2019). The required coordinate information must be given 20 as global attributes in the NetCDF file.

Terrain height
To consider effects of elevation changes on the flow, the terrain height zt can be provided for each discrete (y, x)-location in the model. Data gaps leading to fill values are forbidden. In case zt is not provided, the land surface is set-up at z = 0 m.
The zt can be provided in absolute values, i.e., in meters above sea level, or in relative heights where, e.g., its minimum 25 value is already subtracted. If absolute values are used, PALM will subtract the minimum value within the domain itself to save computational grid points (no computations are needed within the soil). At this point we note that the original terrain height might be further processed and slightly modified by PALM to fulfill certain requirements, which is described in detail in Sect.

Surface classification
In order to parameterize atmosphere-surface-interactions, PALM needs to solve the energy balance at physical surfaces. For doing this, several physical surface parameters such as the heat capacity, roughness, albedo, emissivity, information about vegetation, etc., must be known. To allow for proper LOD1 initialization of material parameters and surface properties via pre-defined lists, PALM classifies all horizontal and vertical surfaces in the model according to their general type, e.g., whether 5 it is a building or a vegetation surface. PALM considers four different types of surfaces: building, vegetation, pavement and water surfaces, while the surfaces are classified in a two-step approach. In a first step, grid points are flagged as atmosphere, building or terrain grid point. Surfaces which belong to a building grid point are automatically flagged as building surfaces, while surfaces which belong to a terrain grid point are flagged as land surfaces. In a second step, surfaces are further specified according to their respective type, which enables proper LOD 1 initialization with pre-defined lists for material and surface 10 properties. For this reason, input of building_type, vegetation_type, pavement_type and water_type is required. At each (y, x)-location at least one of these types must have a non-missing value, so that each surface element can be classified appropriately, either as a pavement, vegetation, or water. It is also required that the given * _type matches with the general classification into building-and land-surface, i.e., at locations where buildings (see Sect. 2.5) are defined also building_type must be defined, while at land-surfaces at least one one of vegetation_type, pavement_type or 15 water_type must be defined.

Buildings
Information on the location and height of buildings can be provided as two-dimensional buildings heights (buildings_2d, LOD 1) or as a three-dimensional integer array (buildings_3d, LOD 2), where each building (non-building) grid point is masked by 1 (0). At locations where no buildings are located, buildings_2d may contain fill values, while buildings_3d 20 must not contain any fill values. In the LOD 1 case, buildings are always mounted on the Earth's surface and overhanging structures such as tunnels or bridges are not allowed, while in the LOD 2 case also overhanging obstacles are allowed. In both cases, building information is given relative to the terrain height and buildings are mapped onto the top of the terrain during model initialization, which is described in detail in Sect. 5.2. At this point we note that PALM can also consider bridges, which can be input as three-dimensional building structures but require a special treatment as further discussed in Sect. 5.2. 25 To distinguish between single buildings, e.g., in order to map them accordingly onto the underlying terrain (please see Sect. 5.2), or to compute the energy demand of single buildings (Maronga et al., 2019), each building has an unique identification number (building_id) that must be given in the static input file at each (y, x)-location where buildings_2d or buildings_3d is defined.
Further, in case the energy balance for building surfaces should be solved (Resler et al., 2017;Maronga et al., 2019), infor-30 mation on the type of the buildings must be provided. To solve the energy balance at building surfaces appropriately, various wall material and surface properties must be known, e.g., wall thicknesses, heat capacities and conductivities, window and wall fractions, albedo, etc. which depend on the individual construction parameters and the current state of building restora-tion. As many of these information are quite often unknown, buildings are classified into characteristic types in order to use default parameters. For this, buildings are classified according to their year of construction and its general usage (residential or office building). PALM provides lists of wall material and surface parameters for six building types: 1 -residential buildings build before 1950, 2 -residential buildings build between 1951 and 2000, 3 -residential buildings build after 2001, 4 -office buildings build before 1950, 5 -office buildings build between 1951 and 2000, and 6 -office buildings build after 5 2001, and 7 -bridges. At this point we note that building_type = 7 is exclusively used to identify bridges and to distinguish them from other three-dimension building structures. The respective building_type must be provided for each discrete (y, x)-location where building_2d or building_3d is defined. Even though building parameters are difficult to aggregate in practice, PALM nevertheless allows to prescribe different types defined among a single building, e.g., to consider building extensions with different physical properties or different usage in case such information is available. In addition, to 10 modify wall material and surface parameters at different (y,x)-location or even at single surface elements, building_pars or building_surface_pars, respectively, can be optionally provided.

Land surfaces
Beside the general classification via pre-defined parameter lists for each land-surface type, physical surface parameters can be further specified with vegetation_pars, water_pars or pavement_pars, which can be optionally provided, as 15 explained in Sect. 2.1.

Vegetation
PALM distinguishes between parameterized vegetation that is not resolved by the numerical grid and thus considered flat (e.g., short grass), and tall vegetation that can be partially resolved by the numerical grid, depending on the grid spacing used (e.g., shrubs or trees). Parameterized vegetation is considered within the energy balance solver for land surfaces, where the given 20 vegetation_type defines the physical properties at the respective surface element. These can optionally be specified in more detail by vegetation_pars, which may contain missing values for single parameters and locations, and the respective properties are only updated and customized where they contain non-missing values, i.e., it is allowed to provide parameters only at locations where these are available. At parameterized vegetation surfaces, additional information concerning the rootarea-density distribution (root_area_dens_s) within the soil can be optionally provided. If it is not provided it is taken 25 from bulk parameter lists defined by the given vegetation_type.
In contrast to parameterized vegetation, resolved vegetation directly accounts for a sink term in the momentum equations (e.g. Kanani-Sühring and Raasch, 2015) and directly affects its surroundings via shading and three-dimensional reflections (Resler et al., 2017). To consider these effects in the model, information about the leaf-area density (LAD) within the respective grid volumes is required and can be input via LAD, which is mapped on top of the underlying terrain. The leaf-area density in further specify water surface parameters, water_pars can be optionally provided.

Soil classification
To consider the interaction of the land surface with the underlying soil at vegetation-and pavement surfaces, a soil_type must be given at grid cells that are classified as vegetation or pavement surfaces, defining a list of default physical parameters for pre-defined soil types. soil_type can be given for different level of detail. For LOD1, soil_type must be provided 15 for each (y, x)-location where vegetation or pavement is defined, assuming that soil properties are vertically homogeneous, while for LOD2 soil_type is given for each (z soil , y, x)-location, with z soil being the depths of the soil layers, in order to consider variations of soil properties also in the vertical direction. To further customize physical soil parameters, soil_pars can be optionally provided either as LOD 3 to provide vertically homogeneous parameters, or as LOD 4 to provide vertically heterogeneous parameters, i.e., each soil layer at each relevant surface element can be given individual physical properties. 20 soil_pars may contain missing values and is only used to update the physical soil parameters at locations and for parameters that are non-missing, i.e., it is allowed to provide only single parameters at locations where this information is available.

Surface albedo
Information concerning the albedo for the different surfaces is already given within the pre-defined parameter lists for building and land-surfaces. However, more detailed information concerning the albedo_type (predefined list of broadband and spec-25 tral albedos for direct and diffuse radiation) or broadband / spectral surface albedos at each (y, x)-location (albedo_pars) can be provided optionally. Masson et al. (2020) review various data sources for urban climate models at meso-and micro-scale. The requirement of a spatial resolution of 1 -10 m for building-resolving simulations, being a key focus of PALM, reduces the available sources of data significantly. The most important sources are remote sensing, governmental/municipal data and open data, as they allow area-wide and automated pre-processing, while field surveys and manual mapping are only a practicable option for small areas 5 of interest, e.g. of one building block in a city. Often, a combination of different sources is required to achieve a consistent coverage with detailed information for the entire area of interest.
The observation of the characteristics and dynamics of the Earth's surface by means of remote sensing has become increasingly important in recent years. In general, remote sensing approaches take advantage of the fact that material-or object-specific interactions occur between the surface and land cover type on the one hand, and the electromagnetic radiation interacting with 10 them on the other hand. This specific spectral signature or back-scattering pattern can then effectively be used to identify and discriminate different surface and material types. Active imaging systems such as radar or laser scanners carry their individual radiation sources (Baghdadi and Zribi, 2016). The intensity and pattern of the backscattering then allows mapping the position, type and, in case of laser scanners, height of surfaces and objects. This can be used to create digital surface models, e.g., based on the radar satellite TerraSAR-X at a global scale (Rizzoli et al., 2017) or at local scale using airborne LiDAR ssystems (Yan 15 et al., 2015). Optical remote sensing makes use of the reflected radiation of the sun. There is a broad range of systems available, mounted on satellite platforms as well as airborne and UAV-mounted sensors. The selection of the sensor used depends on spectral characteristics, spatial resolution, availability for the area of interest and costs. Typical mapping tasks carried out with optical remote sensing are land cover mapping (e.g. Khatami et al., 2016;Wulder et al., 2018) and vegetation characterization (e.g. Verrelst et al., 2015). As the PALM model requires high spatial resolution for performing building-resolving simulations, 20 the free and open Sentinel-2 satellite data is of interest as well as the data of commercial satellite constellations like Rapid Eye or World View. Additionally, false color airborne imagery, with its very high spatial resolution,would be preferable if available for the right time of year.
Especially in developed countries, public authorities and agencies routinely collect a vast amount of geo-spatial data sets.
The following focuses on the situation in Germany, because the selected study areas for the model development are located 25 here. In Germany, available official data is hosted at different levels of agencies and departments (municipal, federal state, state, cadastral office, etc.). The accessibility of the data differs between the federal states and municipalities. In some federal states, such as Berlin, Hamburg, Thuringa or North-Rhine-Westfalia the data is easily accessible and downloadable and available through the Open Data Licence. These data sets are also regularly updated. The possibility to use additional official data depends on the purpose and costs. Municipalities interested in the resulting micro-climate simulations usually provide their streets, public open spaces and water bodies. ATKIS and ALKIS are regularly updated every one to three years (depending on the land use category). The municipal parks and open spaces departments host the data of the public green spaces and tree register. The latter is usually only available for trees on public ground. Information about trees and green spaces on private property have to be derived from additional data sources. If a tree register is available it provides comprehensive information on tree species, age, height, and sometimes also crown and stem diameter. Building data is provided in form of 3D building 5 models in level of detail LOD1 (block buildings without exact roofs) or LOD2 (more detailed with roof parts). Whereas LOD1 data sets are available for entire Germany, LOD 2 data is currently only available for some federal states. Nevertheless, a German-wide coverage is planned for the end of 2019 (Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland, a, b). A standardized data format for 3D city models is City Geography Markup Language (CityGML, Open Geospatial Consortium (2012)), an XML based data format that can be used to describe the city in 3D at 10 different levels of detail. Digital terrain, surface models or LiDAR data as well as aerial images are available at the departments of geo-information or land survey administration at the federal state level in general. Aerial images are updated in a 2 to 5 years period to monitor the green volume development. For this purpose the images have to include the near infrared band. The acquisition dates differ -dependent on their primary purpose-from early spring to summer and thus have a minimal or dense broad leave cover. Only the summer images present the phenological state needed to detect the tree canopy and with that the 15 leaf area index (LAI) and leaf area density (LAD). Soil data is available at the municipal level or at the state level in different scales from 1:10,000 till 1:200,000. The municipal data cannot provide the full information for a model parameterization, so that additional data acquisitions and/or data fusion is needed.
Surprisingly, municipalities -at least in Germany -usually don't systematically collect spatially detailed information on the road network and pavement types. This gap can be closed using Volunteered Geographical Information from the Open-20 StreetMap (OSM) project. One caveat of such crowd driven data collection is that anybody can add any features and tags they think relevant, so no homogeneous data quality, completeness and adherence to a single standard can be guaranteed (Quinn and Bull, 2019). However Haklay (2010) and Graser et al. (2015) show that at least in western Europe OSM has a data quality en par with governmental sources. OSM can be utilized to add missing information to the government data e.g., about road type, pavement type, bridges, pedestrian crossing points, and water bodies.

25
Geodata can be stored in raster format, with a value for each raster cell or pixel. GeoTIFF is a common format supported by all geo-processing software, but also the NetCDF supports geo-spatial information. Spatial data in vector format uses the locations of point, lines and polygons with optionally attached attribute tables containing additional information on each spatial object. A commonly used vector format is the ESRI Shapefile. Governmental data are often in vector format, as there are many attributes that describe an object. Remote sensing data are mainly raster data, as such data is recorded by the sensor in a regular 30 grid.
This section introduces an exemplary strategy and workflow for the selection and pre-processing of the input data sources for PALM 6.0. This is demonstrated for three cities in Germany with varying availability of input data: Berlin, Stuttgart and Hamburg. Despite the variety of data sources, it was aimed to automate the pre-processing for each layer as much as possible to ensure replicability and to handle the vast volume of data (0.5 -1 TB per city, depending on target resolution and city area). 5 This resulted in a collection of pre-processing scripts to which adaptations have been made for each of the three cities.
The municipal data of Berlin including the aerial imagery and 3D city model was retrieved from the Berlin Geoportal FIS Broker 3 . The municipal data including the aerial imagery and 3D city model Hamburg was retrieved from the Transparenzportal of the governemental offices in Hamburg 4 . The municipal data including the aerial imagery and 3D city model of Stuttgart was provided by the Landeshauptstadt Stuttgart for use in the [UC]² project (Scherer et al., 2019). Other sources of data are indicated 10 directly in the text.

Terrain height
Active remote sensing systems are valuable sources to generate digital surface models. For the layers of Berlin, Stuttgart and Hamburg, products of two different active sensors are combined. Within the municipality boundaries, the terrain height is directly retrieved from the 3D city model's LOD0 data derived from airborne LiDAR data as provided by each of the 15 municipalities 5 6 . As this data set ends at the municipal boundaries, a satellite based, yet coarser data set is added to provide terrain height for the surrounding areas, as PALM always requires a rectangular area. In this case, the 30 m SRTM digital elevation model (DEM) was used. It is derived from the Shuttle Radar Topography Mission (SRTM) (Farr et al., 2007). For Stuttgart and Hamburg, the SRTM data set first had to be transformed from the global lat/lon geoid to the local German geoid.
Subsequently, the SRTM DEM was clipped and resampled to the study areas with 1 m spatial resolution. Finally, the local 20 terrain model and the SRTM terrain model were merged, with the local terrain model as primary source. A feathering distance of 100 pixels was assigned for borders of the local terrain model to smooth any abrupt changes in height between the two data sets. The final terrain height data set for each of the three cities is shown in Fig. 1.

Surface classification
PALM differentiates between building and land surface grid cells, where the land surface grid cells must consist of vegetation, 25 pavements or water bodies (see Sect. 2.4). The first task within the surface classification is to map these four classes (buildings, vegetation, pavements and water bodies). In the sections below it is described how the according maps are prepared so that they can be written out into the static driver file. As soon as they are available, possible gaps, which usually result from combining data sets of different sources or as result of rasterization, need to be filled, in at least one of the core classes (see Sect. 2.1). To achieve this, the layers were ranked according to their spatial reliability, e.g, the building layer was preferred over theoften courser -vegetation layer. Secondly, extra secondary buffered input layers were generated where possible and used to fill in their primary layer for pixels were none of the primary layers or prior filled layers had valid data. For example this was necessary for roads, where exact information on roadside parking was not available and thus the actual paved surface is wider than what would be expected from the road width. If there were still holes after all the filling iterations they were filled with a 5 prevalent reasonable value like bare soil.
After the general surface classification is done, unique IDs for each of the buildings and bridges are generated to mark which pixels belong to the same object to support processing in PALM (see Sect. 5.2).

Buildings
For all the cities the building height, building type and building IDs were derived.

Building height
For realistic simulation results of both the flow and the thermodynamic interaction with the urban canopy, it is essential to have the spatially resolved and correct building height. In Germany municipalities have 3D building outlines in LOD1 (block model) or LOD2, which contains differentiated roof structures and therefore allows spatially explicit height calculation for each pixel. For Hamburg and Berlin LOD2 data was available as CityGML (Open Geospatial Consortium, 2012) data 78 , while 15 in Stuttgart LOD2 building height data was provided as 3D triangulated irregular network (TIN). The developed approach to 7 Berlin: https://fbinter.stadt-berlin.de/fb/feed/senstadt/a_lod2 8 Hamburg: http://suche.transparenz.hamburg.de/dataset/3d-stadtmodell-lod2-de-hamburg4?forceWeb=true calculate the building height is a two step approach. In a first step the 2D-coordinates of all pixel centroids inside a single polygon as well as the 3D-bounding box of the polygon is calculated using the algorithm from the GDAL Library GDAL/OGR contributors (2019), which can cope with complex building geometries including inner courtyards etc. If the polygon is of single height, which is (nearly) always the case for floor polygons, this single height is used for all pixels. For each single centroid-coordinate the 3D intersection between its vertical line and the plane of the polygon is calculated to get the height of 5 the building at this position. The special cases that the vertical line is inside or parallel to the plane (building walls directly passing through pixel centroids) are filtered. The calculated height values are capped to the z-range of the bounding box. This is necessary to compensate for rounding errors in nearly vertical planes, which could lead to single intersects wrongly being near infinite. The minimal and maximal height intersection is stored for each pixel. This approach works on the assumption that all the single polygons are planar polygons as defined in the cityGML standard (Open Geospatial Consortium, 2012), 10 where all points of a polygon are in the same plane. However, for some single buildings in Berlin this assumption proved wrong as roof planes included single points from the walls of floors. These polygon errors were corrected where possible and otherwise removed from further calculations. Once all polygons have been processed the building height is calculated as the difference between max and min intersection. In a second iteration the same approach is repeated for all pixels that intersect the boundary of the polygons, but where the centroid is outside the polygon. However, these height values are only used for pixels 15 where no building height was calculated in the first iteration. This two step approach guarantees that the extrapolation (with capping at the z-range of the polygon) to coordinates outside the building footprint is only performed if no other polygons contains the centroid of this pixel. Therefore a higher building which slightly intersect the corner of a pixel will not interfere with lower buildings that cover the pixel, while still removing a lot of single pixel holes with the other base layers (water, vegetation, pavement) This approach worked well for the city of Hamburg and Berlin, where some broken polygons hat to be 20 fixed beforehand. However in Stuttgart not all buildings had a closed floor polygon. Therefore in this case the terrain height was used as the lower boundary of the buildings. An example of the building height from the CityGML data of Berlin is given in Fig. 2 for the Reichstag building.

Building type
The building type used in PALM is defined through a combination of building use and the age of the building (see A2). In 25 Germany, the municipalities often maintain a building use data base, e.g., in the ALKIS data sets. Usually this data is provided at building block level and therefore often contains mixed uses. For Stuttgart and Hamburg this data set was used to distinguish between residential and other building use 9 . The lookup table is given in Table B8. The cities of Hamburg and Stuttgart additionally maintain a data base documenting the age of the building. This allowed to assign the building type with quite high reliability. For Berlin, a combined data set is available, where building blocks are categorised by use and building construction 30 period at the same time 10 . The look-up table to translate this information to building types is listed in Table B7. The building type maps for Stuttgart, Berlin and Hamburg are presented in Fig. 3. Note, however, that in Germany there is no cadastral 9 Hamburg: http://suche.transparenz.hamburg.de/dataset/alkis-ausgewahlte-daten-hamburg6?forceWeb=true 10 https://fbinter.stadt-berlin.de/fb/wfs/geometry/senstadt/re_isu5  information on restoration and heat insulation actions for individual buildings, so that the age of building construction used for building classification is often a rather poor proxy for the thermodynamic properties of buildings.

Vegetation
PALM can handle very detailed information on the vegetation, as outlined in Sect. 2.6.1. In this study, the vegetation type was determined as well as vegetation on roofs and several characteristics of trees. As an area wide approach satellite or airborne imagery provides accurate information on the location of the vegetation as well as some vegetation characteristics, but not all.
Luckily, in Germany there is a huge amount of information available from municipal data that cannot be retrieved with remote 5 sensing data alone. Also, open data such as OSM and other citizen science projects can provide valuable information on the urban vegetation. Therefore these sources are all combined to derive as complete input data for PALM as possible.

Vegetation type
For the vegetation type layer municipal data was used in the three demo-cities, including ALKIS (Berlin) and the Biotope Cadastre (Hamburg). For Stuttgart, such municipal data was not available, thus OSM was used as a main source here. Subse-10 quently, gaps were filled with data from Corine Land Cover (CLC, European Union (2017)), which was especially the case for the areas outside the municipal borders. Missing data and gaps in the layer between vegetation and other features are filled up using aerial color and infrared images (CIR images), using a threshold on the NDVI to differentiate between grass and trees.
For the different data sources lookup tables have been created to map the classes of OSM 11 (Table B2), CLC (Table B2)

Vegetation on roofs
Intensive and extensive green roofs are detected using municipal 10 to 20 cm Ortho near infrared (CIR) images in combination with the building footprints for the cities of Berlin and Stuttgart. The percentage of green roof vegetation is aggregated for each building roof Pixel. The mostly very extensive green vegetation on roofs is detected by analysing the Normalized Difference Vegetation Index (Rouse et al., 1974): where ρ nir is the reflection in near infrared part of the spectrum and ρ red the reflection in the red part of the spectrum. The In Hamburg no near infrared aerial data was available at the time to create the green roof layer.

20
To resolve tall vegetation (i.e., trees), a range of additional parameters have to be specified that support the generation of leaf area density (see Sect. 5.3). In this study, we aimed at deriving tree height, crown diameter, trunk diameter, tree type and tree species, as well as leaf area density, which is described in Sect. 4.4.5. While tree height and crown diameter could be derived from LiDAR data (Fassnacht et al., 2016), the other parameters are very difficult to acquire without extensive field surveys. Luckily, many German cities have so called tree cadastres, where they store exactly these characteristics to support 25 the maintenance of public trees. Such data sets were available for Berlin, Hamburg and Stuttgart, although the Stuttgart data set only included tree species. Please note that in these municipal data sets only public trees (e.g., along public roads) are included.
Private trees, e.g., in gardens and public parks are missing. If no additional sources are available (e.g., as described in Sect.

4.4.4, this means that the uncertainty increases of representing real-world conditions in PALM.
To prepare the data for PALM, look-up tables for tree type and species were created, which contain all species and types of 30 trees recorded in the three cities. A class number was assigned to each type and species and then joined to the attribute table of the tree cadastre Shapefile. Varying spellings for the same type or species where taken into account and assigned the same value. For the attributes age, height, trunk diameter and crown size all numbers where checked for plausibility and corrected if obviously wrong (typos, wrong unit). The resulting Shapefiles were converted to raster (geoTIFF) and then NetCDF. Fig. 6 shows exemplary tree type and age maps for a subset of Berlin.

Vegetation patch
Instead of providing tree properties of single trees, it is also possible to provide the information on the high vegetation in 5 an area wide manner, as vegetation patches. This is practical, as the tree data sets that where available only cover the public trees. Information on all other trees need to come from another source. Suitable sources are LiDAR data or to some extent also (governmental) forestry data. Not all cities in Germany have LiDAR data sets, but for Berlin we could use a LiDAR based data set as well as forestry data.   Fig.7.

Leaf area index
The LAI is an important parameter for the generation of the LAD, which is described in Sect. 5.3. As the LAI varies largely over the phenological cycle, remote sensing is the most suitable data source. Field measurements on the ground only sample 5 single trees, but with area wide remote sensing data an estimation of the LAI of larger area is possible. Typical approaches for LAI estimation make use of vegetation indices in combination with empirical relationships between a vegetation index and LAI. As only multispectral remote sensing imagery was available for this study, an NDVI based method was selected, making use of Sentinel 2 optical satellite data. Depending on the study area up to three image granules had to be combined to create a complete coverage of the city area. This is only possible if cloud free granules of the same or close dates are available. The Using the Timescan processing chain (Esch et al., 2018) the NDVI was derived for all Sentinel 2 scenes. For each date range, an NDVI mosaic image of the study area was created using GDAL tools (GDAL/OGR contributors, 2019). Then the LAI is 15 calculated using an IDL algorithm for each study area and date. For this, an empirical relationship between NDVI and LAI is used as documented by Wang et al. (2005) for deciduous forest. All non-vegetation pixels are set to 0 (vegetation mask). As the spatial resolution of Sentinel 2 is 10 m, the required resolution of 1 m is reached by resampling the LAI map using a bilinear resampling method. For a subset of Hamburg the estimated LAI is presented in Fig.8 for three seasons.

Pavements
As a source for the pavement layer airborne hyperspectral would be very good. Such spatially and spectrally detailed data would allow a differentiated classification of urban surface materials (van der Linden et al., 2019; Roessner et al., 2001).
However, due to its experimental nature, hyperspectral data is rarely available for whole cities. Therefore, OSM data was used instead. OSM contributors not only mapped road features, but often also indicated the surface materials. As the contributors do 5 not apply homogeneous labels, a lookup table was created to map all the materials listed in OSM for the test cities to the PALM pavement types. If no surface material is indicated, default materials are assumed for each road type (Table B5). Using another look-up table, the materials were matched to the pavement types listed in the PIDS. As the roads in OSM are line features, each road is buffered with the width or, if not available, a default width for that type of road (Table B5 and B6). After rasterization, the data set is checked for gaps between pavement type, vegetation type, buildings and water. Gaps are filled with the road 10 pavement type by applying a larger buffer (3 x the listed diameter) on the road lines. An example of the resulting pavement type raster map is shown in Fig. 9.

Street type and street crossings
For the street types and street crossings data from OSM is used. Street types directly use the classes specified in OSM and are assigned to the road grid cells. If multiple road types cover a pixel, the highest class is assigned. Thus a motorway would have precedence over a primary road ect. A street crossing flag is assigned to all parts of the streets that are marked in OSM as street crossing. As this label is a point feature, all grid cells in a buffer of 15 m around each crossing point are flagged as crossing.

Water bodies
Multispectral remote sensing is a suitable tool to map water bodies (Ma et al., 2019), but at the high spatial resolution required for building-resolving simulations, the spatial resolution of most satellite data is not sufficient. Also aerial images usually do not provide enough (and calibrated) spectral bands, to distinguish smaller water bodies like fountains or rivulets. Therefore, also in this case OSM was used as primary source for the demo-cities. Unfortunately, it turns out OSM is incomplete regarding water 10 bodies. Therefore the data sets were merged with CLC data for Stuttgart and ALKIS and the Biotope Cataster in Hamburg.
Look-up tables were created to assign a PALM water class to each water feature in the different data sets (see Table B9 and B11 had to be added manually. The final water type maps for the cities of Stuttgart, Berlin and Hamburg are presented in Fig. 10.

Soils
As soil data is difficult to acquire, especially at resolutions less then 10 m, a horizontally and vertically homogeneous soil type distribution (with soil_type = 1, coarse soil texture) is assumed in this study, i.e. the physical properties of the soil are identical all over the model domain. Further information on the initial state of the soil moisture and temperature at each pixel can be given as LOD0 via Fortran Namelist input, or as LOD1 input given in the dynamic input file (Maronga et al., 2019).
The respcective soil information can be e.g. take from mesoscale models such as COSMO or WRF, which will be described in a separate follow-up paper.

5
In this section we discuss PALM static driver generator, the generation of three-dimensional vegetation data in terms of LAD and basal area density (BAD) fields from two-dimensional information as part of the static driver, as well as the internal topography processing which is required to ensure all PALM requirements on the terrain data are met. Note that the netCDF interface routine in PALM has undergone several improvements since the official release of PALM 6.0. In the following, we will thus describe the status quo for PALM 6.0 in revision 4311.

Using the PALM static driver generator
In order to enable the user to create static drivers for complex scenarios, the Python 3.0-based pre-processing tool palm_csd (short for: PALM create static driver) is shipped with PALM. The tool comes with a comprehensive library with netCDF functions and utility routines that can also easily be plugged in into user-specific Python codes, and which take care of the correct formatting of static driver files that comply with PALM's netCDF interface. palm_csd itself, however is a wrapping 15 and compiling tool, which compiles static drivers based on already processed and rastered geospatial data in netCDF format, but it cannot process other geospatial file formats (e.g., GeoTIFF or Shapefile). At the moment, it is thus up to the user to process such data manually and provide palm_csd with PIDS conform NetCDFs. Currently, input data for palm_csd is available for the cities of Berlin, Hamburg, and Stuttgart in Germany, for which input data was processed based on the data sources outlined in Sect. 4, but the user is free to provide his own data to be processed by palm_csd. Note that while data 20 for Berlin and Hamburg is freely available for the general public, data of Stuttgart is restricted to be used within the [UC] 2 project. During the pre-processing of the data for Berlin, Hamburg and Stuttgart it was aimed to automate the pre-processing steps as much as possible by implementing the geo-processing in scripts and reduce manual processing in GIS software. In the next phase of [UC] 2 it is planned to develop a pre-processing tool that will support users to generate the input data in PALM conform formats.

25
palm_csd is steered via a configuration file in which input files, basic settings, and default values are defined. Once this configuration file is set-up, the user can generate his own static driver files that include correct metadata and possibly geo-referencing (depending on suitable input data) for PALM and that will also be written to PALM's output data for postprocessing and visualization.
We plan to extent palm_csd for generic and academic setups as well as with a graphical user interface in near future. 30 Moreover, we plan to implement a comprehensive checking routine so ensure compatibility with PALM, which is currently done within PALM itself.

Internal topography processing
During the initialization of PALM, the provided topography data, encompassing terrain height and buildings, is further processed and could be possibly slightly modified, e.g., to fulfill numerical requirements or to reduce the use of computational resources.
The model surface in PALM is internally defined at z = 0 m. Therefore, in a first step, PALM internally computes the 5 relative terrain height z t = z t − z t,min , where z t,min is the minimum terrain height occurring within the model domain. Thus, the minimum z t coincides with the model surface at z = 0 m and the first vertical grid level has at least one grid point that lies within the atmosphere. For instance, if z t is given in meters above sea level and we would use this without any further processing, many grid points may lie below the Earth surface, being a waste of computational resources without providing any additional value. In case of a nested simulation setup with a root domain and various child domains, z t,min is calculated 10 as the minimum terrain height over all domains, in order to have the same reference height for all model domains and avoid artificially induced elevation changes at the domain borders between the parent and the child models.
In the following, z t (y, x) is projected onto the discrete grid, while all grid points are flagged as terrain that are located below z t (y, x), as illustrated in Fig. 11 by the dashed black line.
In a second step buildings are mapped on top of the discrete terrain, which is illustrated schematically in Fig. 11. Especially 15 when the underlying terrain is not flat but elevation changes occur below a building, roof shapes should be maintained, so that buildings can't be simply mapped on top of z t . Hence, the underlying terrain below a single building (which is identified by its building_id) is padded up to the level of the highest z t (y, x) within the building-covered area with respective building_id : z t (y, x) = max(z t (y, x)) ID , i.e., the terrain below the building is flattened (please see the hashed areas in Fig. 11). This guarantees that building and roof shapes are maintained even at steep slopes. However, an exception is made 20 for bridges (identified by building_type = 7) where buildings_3d is directly mapped on top of z t . Flatting the terrain below the bridge to the highest terrain height (often the top of the levee) would otherwise introduce barrier-like topography structures.
While buildings are mapped onto the terrain, grid points that lie within buildings and below terrain are internally flagged, in order to classify building-or land-surfaces during the surface initialization (see Sect. 2.4). The padded grid points below 25 buildings will be not flagged as building but as land-surfaces, while these artificially introduced vertical land-surfaces will be initialized using the given vegetation_type or pavement_type at the adjacent grid cell.
After the topography is finally projected onto the discrete grid, it may contain single cavities or chimney-like holes that are only resolved by one grid point. Due to numerical issues, such one-grid-point cavities must be filtered. In many cases these filtered cavities are building courtyards that are resolved by only one grid point. In this case, the courtyard grid point, which 30 might be originally given, e.g., a vegetation_type, is internally flagged and re-set to a building grid point while it obtains building_type, building_id and, if available, building_pars from the nearby building grid point. Hence, we filter such one-grid point cavities during the model initialization, meaning that small differences might occur between the final building and terrain geometry in the model and the provided one in the static driver.

Generation of three-dimensional leaf area density and basal area density fields
When using PALM at very high resolution in the order of 1 m, vegetation like tall shrubs or trees can not be represented by common parameterizations that assume the vegetation canopy to be flat and represented, e.g., by a roughness length. Under such conditions, PALM employs a plant canopy model in which high vegetation can be represented in terms of three-dimensional LAD fields. As geospatial data usually does not yield any three-dimensional information, three-dimensional LAD and BAD 5 fields must be estimated from two-dimensional data and other data sources. In order to allow for a pseudo-automated generation of LAD and BAD fields, palm_csd comes with two different routines for creating vegetation canopies: A routine for single trees as often found in urban environments, and a routine for creating vegetation canopies like forests and parks. In the following we will outline the basics of both routines. Note, however, that both routines are still in experimental stage and will be further developed and evaluated in the near future. In the following we will thus describe the status quo of these routines.

Generation of leaf area density and basal area density fields for single trees
Single trees, whose growth are seldom affected by other trees or obstacles can be characterized in terms of three-dimensional LAD and BAD fields by a limited number of parameters. In palm_csd these are the maximum tree height, crown diameter, crown shape, trunk diameter, height of the maximum LAD value, and the aspect ratio of tree crown diameter to tree crown height (see Figs. 12 and 13). In German cities, several of these parameters are available from tree cadastral register data. For 15 example, for Berlin more than 400 000 municipal trees are collected in a publicly available database including information about tree species, tree height, crown diameter, stand age, and trunk diameter. The single-tree canopy generator in palm_csd is called for each individual tree and the following information is passed to the generator: location (y, x) of the tree centre, tree type (i.e., genus), tree height, LAI, crown diameter, and trunk diameter at breast height. If one or more of these parameters is not provided, a default value from a look-up table (see Table 16) is used, which was generated based on averaging each tree parameter for each tree type in the Berlin tree database. This look-up table also includes default values for the tree shape, the ratio between crown height to crown width, LAI values for summer and 5 winter time, and the height of the LAD maximum. Note for some of the latter parameters only dummy values are currently available (crown height to width ratio, LAI, height of LAD maximum) and more effort will be needed to fill this table with reasonable data. The tree generator allows for six different tree shapes which are shown in Fig. 13 and which cover most of the commonly observed shapes for single trees. The generation of a three-dimensional LAD volume then consists of two steps. First, the volume covered with leaves is determined based on the shape, crown diameter, tree height, and the ratio of 10 crown height to width. Second, the three-dimensional LAD field is created using an exponentially increasing LAD towards the outer shell of the foliage. This approach is based on the empirical finding that sun light is absorbed when entering the foliage resulting in decreasing production of leaves.
Calculation of three-dimensional BAD fields is available using an interim solution, where the BAD field is calculated from the given trunk diameter, which is taken as constant up to the center of the tree crown. At the moment, the canopy generator only allows to treat each grid volume as either (impermeable) stem or no stem. The representation of grid volumes partially 5 covered by trunks is thus not possible at the moment. BAD values within the crown canopy is calculated as which reflects increasing BAD towards the center of the crown. An example of both lad and bad fields for an idealized sphericalshaped tree is shown in Fig. 14. Note, however, that PALM currently only supports LAD so that only the foliage is read from the static driver. The import of BAD data will be realized in near future.

Generation of leaf area density fields for tree stands
In many cases, information on individual trees is not available or tree stands (e.g., forests) have to be represented as a threedimensional canopy. This is commonly realized by treating each column (y, x) separately and using normalized LAD profiles that are representative for homogeneous canopies. In palm_csd the method of Markkanen et al. (2003) based on a vertical LAD distribution that is derived from a given LAI field as well as two parameters α and β, which can be varied by the user to 15 represent different types of tree stands. Additionally, a two-dimensional vegetation height field can be prescribed (if available) in order to take into account varying tree heights within the canopy stand. If information on LAI and vegetation height is not available, the user has to provide default values instead. Using this method it is possible to generate idealized vegetation canopies in terms of LAD fields, but it provides no means to derive BAD information. In the future we plan to use a similar method as described in Bohrer et al. (2007) to create BAD fields by synthetically localizing tree trunks.

20
In the previous sections, the input data requirements of PALM are described, it is demonstrated how this data can be prepared and what steps are carried out in palm_csd to set-up the static driver to have all input data ready for PALM. PALM comes with a framework that enables micro climate simulations for a real-world urban environment. Different levels of detail can be provided to PALM. If the model is run with interactive building-and land surfaces, a minimum of seven spatial parameters is 5 required: soil type, building height, building id, building type, vegetation type, pavement type and water type. Each of these parameters can optionally be specified in more detail, based on available data. As it becomes clear from Sect. 3 and 4, a vast amount of data exist, but rarely exactly in the format required by urban micro climate models. Exemplary for this is the building type. The combination of building use and building age yields the PALM building types. However, it needs to be analysed if this is the most accurate representation of the energetic properties of the building, as the building age often doesn't include any 10 information on renovation and modernisation of the building which have a huge effect on the energetic properties of a building.
Selecting and acquiring suitable data sets is a major task, that should be weighted against the available resources to pre-process the input data and the desired detail of the PALM simulations. Additionally, the varying quality of different data sources result in different uncertainties of the input parameters. There are uncertainties resulting from the spatial resolution (e.g., the ability to distinguish small objects), but there can also be mapping or labeling errors and omissions. How these uncertainties propagate 15 into the simulation results needs to be investigated in more detail. To support users in the decision, which parameter are worth the effort of acquiring and preparing more detailed information, sensitivity analyses of the input datasets are planned.
As the pre-processing of the input data is tedious, it is aimed to develop a processing chain that support users in formatting their GIS data (e.g., Shapefiles,geoTIFF, WFS ect.) into a NetCDF file following the requirements of PALM. To support model users in their data acquisition, a data base with freely available geospatial data for the mandatory set of parameters of PALM 20 is aimed at. This will provide users a starting point for running PALM simulations. The primarily target will be Germany but also Europe wide or global data can be included, as far as the data sources allow this.

Code availability
The PALM model system is distributed under the GNU General Public License v3 (http://www.gnu.org/copyleft/gpl.html).
The model source, documentation, user manual, and online tutorial are freely-available and can be downloaded from http:

25
//palm-model.org. The pre-processing tool palm_csd to prepare and create a PALM Static driver is shipped with PALM and is available under https://doi.org/10.25835/0041607.

Data availability
In the supplements, a sample static driver is available for a small area in Berlin near Ernst-Reuter Platz, Germany, with 1 m spatial resolution. The static driver is prepared for a winter scenario with leafless deciduous trees. The model domain is 30 256 × 256 m 2 in the horizontal directions.            Camping ground X 0 0