MeteoIO 2 . 4 . 2 : a preprocessing library for meteorological data

Using numerical models which require large meteorological data sets is sometimes difficult and problems can often be traced back to the Input/Output functionality. Complex models are usually developed by the environmental sciences community with a focus on the core modelling issues. As a consequence, the I/O routines that are costly to properly implement are often error-prone, lacking flexibility and robustness. With the increasing use of such models in operational applications, this situation ceases to be simply uncomfortable and becomes a major issue. The MeteoIO library has been designed for the specific needs of numerical models that require meteorological data. The whole task of data preprocessing has been delegated to this library, namely retrieving, filtering and resampling the data if necessary as well as providing spatial interpolations and parametrizations. The focus has been to design an Application Programming Interface (API) that (i) provides a uniform interface to meteorological data in the models; (ii) hides the complexity of the processing taking place; and (iii) guarantees a robust behaviour in case of format errors, erroneous or missing data. Moreover, in an operational context, this error handling should avoid unnecessary interruptions in the simulation process. A strong emphasis has been put on simplicity and modularity in order to make it extremely easy to support new data formats or protocols and to allow contributors with diverse backgrounds to participate. This library is also regularly evaluated for computing performance and further optimized where necessary. Finally, it is released under an Open Source license and is available at http://models.slf.ch/p/meteoio. This paper gives an overview of the MeteoIO library from the point of view of conceptual design, architecture, features and computational performance. A scientific evaluation of the produced results is not given here since the scientific algorithms that are used have already been published elsewhere.


Background
Users of numerical models for environmental sciences must handle the meteorological forcing data with care, since they have a very direct impact on the simulation's results.The 5 forcing data come from a wide variety of sources, such as files following a specific format, databases hosting data from meteorological networks or web services distributing data sets.A significant time investment is necessary to retrieve the data, look for potentially invalid data points and filter them 10 out, sometimes correcting the data for various effects and finally converting them to a format and units that the numerical model supports.These steps are both time intensive and error prone and usually cumbersome for new users (similarly to what has been observed for Machine Learning, Kotsiantis 15 et al., 2006).
From the point of view of the model developer, handling input data is usually a necessary but unpleasant side of model development that distracts from working on the core features of the model.As a consequence developers tend to 20 spend minimal effort on these aspects.Throughout the history of the model, more pre-processing routines will usually be added to the code in order to handle data-related problems as they arise.Moreover, supporting new data formats and/or protocols for specific projects, requires modifying the 25 code by either adding conditional compilation directives or tweaking the current routines.This means that the data reading and preprocessing routines will often be of low quality, lacking robustness and efficiency as well as flexibility, exacerbating the troubles met by the users in preparing their data 30 for the model.
A few libraries or software already tackle these issues, for example the SAFRAN preprocessor of the CROCUS snow model (Durand et al., 1993), the PREVAH preprocessor of the PREVAH hydrological model (Viviroli et al., 2009) or the MicroMet model (Liston and Elder, 2006).However these projects are often very tightly linked with a specific model and infrastructure and are typically not able to operate outside this specific context.They often lack flexibility for example requiring their users to convert their data to a specific 40 file format, by hard coding the processing steps for each meteorological parameter or by requiring to be run through a specific interactive interface.They also often rely on a specific input and/or output sampling rate and can not deal with fully arbitrary sampling rates.MeteoIO aims to overcome these limitations and to be a general purpose preprocessor that different models can easily integrate.

Data quality
A most important aspect of data preprocessing is the filtering of data based on their perceived quality.The aim of filtering data is to remove the mismatch between the view of the realworld system that can be inferred from the data and the view that can be obtained by directly observing the real-world system (Wand and Wang, 1996).We focus on two data quality dimensions: accuracy and consistency.

55
We define accuracy as "the recorded value is in conformity with the actual value" (Ballou and Pazer, 1985).Inaccuracies occur because of a sensor failure (the sensor itself fails to operate properly), because of the conditions of the immediate surroundings of the sensor (the sensor conditions do not reflect the local conditions, such as a frozen anemometer) or because of physical limitations of the sensor (such as precipitation undercatch).
We define consistency in a physical sense, that a data set should obey the physical laws of nature.Practically, the time evolution of a physical parameter as well as the interactions between different physical parameters must obey the laws of nature.

Design goals
In order to help the users of numerical models consuming meteorological data and reduce their need for support, we developed new meteorological data reading routines and invested significant efforts in improving the overall usability by working on several dimensions of the ergonomic criteria (Scapin and Bastien, 1997), adapting them according to the needs of a data preprocessing library: -Guidance: providing a clear structure to the user; -Grouping/distinction of items: so the user sees which items are related; -Consistency: adapt and follow some rules regarding 80 the naming, syntax and handling of input parameters; -Workload: focusing on the tasks that the user wants to accomplish; -Minimal actions: limit as much as possible the 85 number of steps for each tasks; -Explicit control: let the user explicitly define the tasks that have to be performed; -Error management: helping the user detect and recover from errors; 90 -Error protection: handle all possible user input errors; -Quality of error messages: provide clear and relevant error messages.
We also identified two distinct usage scenarios:

95
Research usage.The end user runs the model multiple times on the same data, with some highly tuned parameters in order to produce a simulation for a paper or project.The emphasis is put on flexibility and configurability (Scapin and Bastien, 1997).

100
Operational usage.The model is run fully or partially unattended for producing regular outputs.Once configured, the simulations' setup remains the same for an extended period of time.The emphasis is put on robustness and stability.
We decided to tackle both scenarios with the same soft-105 ware package and ended up with the following goals: -Isolate the data reading routines from the rest of the model; -Implement robust data handling with relevant error messages for the end user; -Implement a "best effort" approach with reasonable fallback strategies in order to interrupt the simulation process only in the most severe cases; -Let the end user configure the whole data reading and 120 preprocessing in a configuration file that can be saved for archiving or later use.models, we decided to implement this software package as a library named MeteoIO.We chose the C++ language in order to benefit from the object oriented model as well as good performance and relatively easy interfacing with other programming languages.We also decided to invest a significant effort in documenting the software package both for the end users and for developers who would like to integrate it into their own models.More architectural principles are laid out in the sections below while the implementation details are given in Sects.3 and 4. 135

Actors
The question of proper role assignment (Yu and Mylopoulos, 1994), or finding out who should decide, is central to the development of MeteoIO: carefully choosing if the end user, the model relying on MeteoIO or MeteoIO itself is the 140 appropriate actor to take a specific decision has been a recurring question in the general design.For example when temporally resampling data, the method should be chosen by the end user while the sampling rate is given by the numerical model and the implementation details and error handling 145 belong to MeteoIO.

Dependencies
When complex software packages grow, they often require more and more external dependencies (as third party software libraries or third party tools).When new features are 150 added, it is natural to try to build on achievements of the community and not "reinvent the wheel".However this also has some drawbacks: these third party components must be present on the end user's computer; 155 these components need to be properly located when compiling or installing the software package; these components have their own evolution, release schedule and platform support.
Therefore, as relying more on external components reduces the core development effort, it significantly increases the integration effort.One must then carefully balance these two costs and choose the solution that will yield the least long term effort.
Estimating that a complex integration issue represents 165 a few days of work and a non negligible maintenance effort, core MeteoIO features that were feasible to implement within a few days were redeveloped instead of integrating existing solutions.For the more peripheral features (like output plugins) we decided to rely on the most basic libraries at hand, disregarding convenient wrappers which would introduce yet another dependency, and to give the user the possibility to decide which features to enable at compile time.Accordingly, MeteoIO requires no dependencies by default when it would have required more than fifteen if no such mitigation strategy 175 had been taken.A handful of dependencies can be activated when enabling all the optional features.

Manager/worker architecture
Many tasks have been implemented as a manager/worker architecture: a manager offers a high level interface to the 180 task (filtering, temporal interpolation, . . . ) while a worker implements the low level, MeteoIO-agnostic core processing.The manager class implements the necessary features to efficiently convert MeteoIO-centric concepts and objects to generic, low level data ideal for processing.All of the heavily 185 specialized programming concepts (object factories, method pointers, etc) and their actual implementations are therefore hidden from both the high level calls and the low level processing.This architecture balances the needs of the casual developer using the library (relying on very simple, high level 190 calls) as well as the casual developer expanding the library by contributing some core processing modules (data sources, data filters, etc).Although this approach might seem inefficient (by adding extra steps in the data processing), it has contributed to the 195 performance gains (as shown in Sect.5.2) by making it possible to rely on standard, optimized routines.

Flexibility
Hard-coding the data preprocessing in the source code is an easy possibility but requires that the user recompiles his 200 model when configuring the data preprocessing.In order to avoid this and thus offer more flexibility, all adjustable parameters are configured in a text file following the more or less standard INI ASCII format.This makes it possible to configure the preprocessing simply by editing a text file, 205 copying configuration sections between different simulations and potentially providing a graphical user interface to help the user configure his simulation (Bavay and Egger, 2014).Moreover, instead of having to keep multiple files representing the data at various intermediate processing stage along-210 side a textual description of the processing steps that have been applied, it is possible to simply archive the raw data and the configuration file that then acts as a representation of the preprocessing workflow.This greatly simplifies the data traceability for a given simulation.

215
For clarity, each step of the data reading, preprocessing and writing is described in its own section in the configuration file.There is no central repository or validation of the keys to be found in this file, leaving each processing component free to manage its own configuration keys.On the other been provided by the user but will not be used by any component.
No assumptions are made about the sampling rate of the data read or the data required by the caller.It is assumed that the input data can be sampled at any rate, including irregular sampling and can be resampled to any timestamp, as requested by the caller.Moreover any data field can be nodata at any time.This means that a given data set might contain for example precipitation sampled once a day, temperatures sampled twice an hour and snow height irregularly sampled.Practically, this prevents us from using standard signal processing algorithms for resampling data, because these commonly assume a constant sampling rate and require that all timestamps have a value.

Modularity
A key to the long term success of a software package is the modularity of its internal components.The choice of an object oriented language (C++) has helped tremendously to build modular elements that are then combined to complete 240 the task.The data storage classes are built on top of one another (through inheritance or by integrating one class as a member of another one) while the data path management is mostly built as a manager that links all the necessary components.A strong emphasis has been put on encapsulation by answering, for each new class, the following question: How should the caller interact with this object in an ideal world?Then the necessary implementation has been developed from this point of view, adding "non-ideal" bindings only when necessary for practical reasons.

Promoting interdisciplinary contributions
Modularity, by striving to define each data processing in a very generic way and by making each one independent of the others, presents external contributors with a far less intimidating context to contribute.The manager/worker approach shown in Sect.2.3 also facilitates keeping the modules that are good candidates for third party contributions simple and generic.Some templates providing a skeleton of what should be implemented are also provided alongside documentation on how to practically contribute with a short 260 list of points to follow for each kind of contribution (data plug-in, processing element, temporal or spatial interpolation).

Coding standards and methodology
The project started in late 2008 and is currently comprised of more than 52 000 lines, contributed by twelve contributors.95 % of the code originates from the two main contributors.The code mostly follows the kernel coding style as well as the recommendations given by (Rouson et al., 2011), giving the priority to code clarity.Literate programming is used with the doxygen tool (van Heesch, 2008).
Coding quality is enforced by requesting all committed code to pass the most stringent compiler warnings (all possible warnings on gcc) including the compliance checks with recommended best practices for C++ (Meyers, 1992).The 275 code currently compiles on Linux, Windows, OS X and Android.
The development methodology is mostly based on Extreme Programming (Beck and Andres, 2004) with short development cycles of limited scope, architectural flexibility 280 and evolutions, frequent code reviews and daily integration testing.The daily integration testing has been implemented with ctest (Martin and Hoffman, 2007), validating the core features of MeteoIO and recording the run time for each test.This shows performance regressions alongside feature 285 regressions.Regular static analysis is performed using Cppcheck (Marjamäki, 2013) and less regularly with Flawfinder (Wheeler, 2013) to detect potential security flaws.Regular leak checks and profiling is performed relying on the Valgrind instrumentation framework (Seward et al., 2013; 290 Nethercote and Seward, 2007).
The code has also been optimized to benefit from single instruction, multiple data (SIMD) capabilities when feasible and some kind of universal serialization has been implemented to ease the use of MeteoIO objects within a paral-295 lel application: each storage class implements the redirection operators, serializing and deserializing to/from a standard iostream object.This object is then passed to the parallelization toolkit or library (such as MPI, the Message Passing Interface) as a pure C structure through a very simple 300 wrapper in the calling application.

Data structures
All data classes rely on the Standard Template Library (STL) (Musser et al., 2001) to a large extent that is available on all C++ compilers and may provide some low level optimiza-305 tions while being quite high level.The design of the STL is also consistent and therefore a good model to follow: the data classes in MeteoIO follow the naming scheme and logic of the STL whereever possible, making them easier to use and remember by a developer who has some experience with the 310 STL.They have been designed around the following specific requirements: -Offer high level features for handling meteorological data and related data.Using them should make the calling code simpler.

315
-Implement a standard and consistent interface.Their interface must be obvious to the caller.
-Implement them in a robust and efficient way.Using them should make the calling code more robust and faster.
The range of high level features has been defined according to the needs of models relying on MeteoIO as well as in terms of completeness.When appropriate and unambiguous the arithmetic operators and comparison operators have been implemented.Each internal design decision has been based on careful benchmarking.
Great care has been taken to ensure that the implemented functionality behaves as expected.Of specific concern is that corner cases (or even plain invalid calls) should never produce a wrong result but strive to produce the expected result or return an exception.A silent failure would lead to possibly erroneous results in the user application and must therefore be avoided at all cost.

Configuration
In order to automate the process of reading parameters from the end user configuration file, a specific class has been created to manage configuration parameters.The Config class stores the configuration options as a key-value couple of strings in a map.The key is built by prefixing the actual key with the section it belongs to.When calling a getter to read a parameter from the Config object, it converts data types on the fly through templates.It also offers several convenience methods, such as the possibility of requesting all keys matching a (simple) pattern or all values whose keys match a (simple) pattern.

Dates
The Date class stores the GMT Julian day (including the time) alongside the timezone information (because leap seconds are not supported, the reference is defined as being GMT instead of UTC).The Julian day is stored in double precision which is enough for one second resolution while keeping dates arithmetic and comparison operators efficient.The conversion to and from split values is done according to (Fliegel and van Flandern, 1968).The conversion to and from various other time representations as well as various formatted time strings and rounding is implemented.

Geographic coordinates
The geographic coordinates are converted and stored as latitude, longitude and altitude in WGS84 by the Coords class.This allows an easy conversion to and from various Cartesian geographic coordinates systems with a moderate loss of precision (on the order of one meter) that is still compatible with their use for meteorological data.Two different strategies have been implemented for dealing with the coordinate conversions: -Relying on a the proj4 third party library (pro, 2013).
This enables to support all coordinate systems but brings an external dependency.
-Implementing the conversion to and from latitude/longitude.This does not bring any external depen-370 dency but requires some specific (although usually limited) development.
Therefore the coordinate systems that are most commonly used by MeteoIO's users have been reimplemented (currently the Swiss CH1903 coordinates, UTM and UPS Hager et al.,375 1989) and seldom used coordinate systems are supported by the third party library.It is also possible to define a local coordinate system that uses a reference point as origin and computes easting and northing from this point using either the law of cosine or the Vincenty algorithm (Vincenty, 1977) 380 for distance calculations.These algorithms are also part of the API and thus available to the developer.

Meteorological data sets
The meteorological data are centered around the concept of a weather station: one or more meteorological parameters (in 385 the MeteoData class) measured at one location (this location can change in time).The station has coordinates (including an elevation) and often a name or identifier associated with it as well as a slope and azimuth (all belonging to the Station-Data class).For each timestamp, a predefined set of meteoro-390 logical parameters has been defined and parameters that are not available receive a nodata value.This set can be extended by defining additional parameters that will then be handled the same way as the fixed parameters.Some basic merging strategies have been implemented in order to merge measure-395 ments from close stations (for example when a set of instruments belongs to a given measuring network and another set, installed on the same mast belongs to another network).
A static map does the mapping between predefined meteorological parameters (defined as an enum) and an index.

400
A vector of strings stores a similar mapping between the predefined meteorological parameters' names as strings and the same index (i.e. a vector of names).Finally a vector of doubles (data vector) stores the actual data for each meteorological parameter, according to the index defined in the static 405 map or names vector.When an extra parameter is added, an new entry is created in the names vector as well as a new entry in the data vector at the same index.The total number of defined meteorological parameters is updated, making it possible to access a given meteorological field either by in-410 dex (looping between zero and the total number of fields), by name (as string) or by predefined name (as enum).Methods to retrieve an index from a string or a string from an index (or enum) are also available.

415
Grids have been implemented for one dimensional to four dimensional data as templates in the Array classes in order to accommodate different data types.They are based on the standard vector container and define the appropriate access helper methods (retrieving the minimum, maximum or mean value of the data contained in the grid, for example) and standard arithmetic operators between grids and between a grid and a scalar.A geolocalized version has been implemented in the GridObject classes that brings about added safety in the calling code by making it possible to check that two grids refer to the same domain before using them.

Digital elevation model
A special type of two dimensional grid (based on the Grid2DObject class) has been designed to contain digital elevation model (DEM) data.This DEMObject class automatically computes the slope, azimuth and curvature as well as the surface normal vectors.It lets the developer choose between different algorithms: maximum downhill slope (Dunn and Hickey, 1998), four neighbours algorithm (Fleming and Hoffer, 1979) or two triangle method (Corripio, 2003) with an eight-neighbour algorithm for border cells (Horn, 1981).The azimuth is always computed using (Hodgson, 1998).The two triangle method has been rewritten in order to be centered on the actual cell instead of node-centered, thus working with a local 3×3 grid centered around the pixel of interest instead of 2 × 2. The normals are also computed as well as the curvatures, using the method of (Liston and Elder, 2006).
The evaluation of the local slope relies on the eight immediate neighbours of the current cell.Because this offers only a limited number of meaningful combinations for computing the slope, some more recent slope calculation algorithms that have been explored are actually exactly equivalent to the previously listed algorithms.In order to transparently handle the special cases represented by the borders (including cells bordering holes in the digital elevation model), a 3 × 3 grid is filled with the content of the cells surrounding the current cell.Cells that cannot be accessed (because they don't exist in the DEM) are replaced by nodata values.Then each slope algorithm works on this subgrid and implements workarounds if some required cells are set to nodata in order to be able to provide a value for each pixel that it received.This makes the handling of special cases very generic and computationally efficient.
Various useful methods for working with a DEM are also implemented, for example the possibility to compute the horizon of a grid cell or the terrain following distance between two points.a vector containing all the meteorological data that could be provided at the requested date, grouped by stations with their metadata.Each parameter either contains nodata or the preprocessed value following the configuration by the end user.
A model requiring spatially interpolated values will use 475 the call shown in listing 2. This call returns a grid filled with the spatially interpolated parameter as specified by meteoparam at the requested date over the provided DEM.If the grid could not be filled according to the requirements provided by the user, the grid will be empty and the call will return false.

480
In the background, within MeteoIO, the process of providing the forcing data to the numerical model according to the constraints specified by the user has been split into several steps (see Fig. 4): 1. getting the raw data; Practically, the raw data is read by the IOHandler component through a system of plug-ins.These plug-ins are low level implementations providing access to specific data sources and can easily be developed by a casual devel-495 oper.The data is read in bulk, between two timestamps as defined by the BufferedIOHandler that implements a raw data buffer in order to prevent having to read data out of the data source for the next caller's query.This buffer is then given for filtering and resampling to the MeteoProces-500 sor.This will first filter (and correct) the whole buffer (by passing it to the ProcessingStack) since benchmarks have shown that processing the whole buffer at once is less costly than processing individually each time steps as they are requested.The MeteoProcessor then temporally interpolates 505 the data to the requested time step (if necessary) by calling the Meteo1DInterpolator.A last resort stage is provided by the DataGenerator that attempts to generate the potentially missing data (if the data could not be temporally interpolated) using parametrizations.

510
Finally, the data is either returned as such or spatially interpolated using the Meteo2DInterpolator.The whole process is transparently managed by the IOManager that remains the visible side of the library for requesting meteorological data.The IOManager offers a high level interface as well as some 515 configuration options, allowing for example to skip some of the processing stages.The caller can nevertheless decide to manually call some of these components since they expose a developer-friendly, high level API.

Data reading
All the necessary adaptations for reading data from a specific data source are handled by a specifically construed plug-in for the respective data source.The interface exposed by the plug-ins is very simple and their tasks very focused: they must be able to read the data for a specific time interval or a specific parameter (for gridded data) and fill the MeteoIO data structures, converting the units to International System of Units (SI).Similarly, they must be able to receive some MeteoIO data structures and write them out.Several helper functions and classes are available to simplify the process.

530
This makes it possible for a casual developer to readily develop his own plug-in, supporting his own data source, with very little overhead.
In its current version MeteoIO includes plug-ins for reading and/or writing time series and/or grids from Oracle and 535 PostgreSQL databases, the Global Sensor Network (GSN) REST API (Michel et al., 2009), Cosmo XML (cos, 2013), GRIB, NetCDF, ARC ASCII, ARPS, GRASS, PGM, PNG, GEOtop, Snowpack and Alpine3D native file formats and a few others.

540
The proper plug-in for the user-configured data source is instantiated by the IOHandler that handles raw data reading.Usually, the IOHandler is itself called by the Buffere-dIOHandler in order to buffer the data for subsequent reads.The BufferedIOHandler is most often called with a single 545 timestamp argument, computes an appropriate time interval and calls IOHandler with this time interval, filling its internal buffer.

Data processing
IOManager utilises the methods exposed by the MeteoPro-550 cessor.This is a high level interface that transparently encloses both the data processing and the resampling stages.These two stages are handled by the ProcessingStack and the Meteo1DInterpolator, respectively.
The ProcessingStack reads the user configured filters and 555 processing elements and builds a stack of ProcessingBlock objects for each meteorological parameter and in the order declared by the end user.The time series are then passed to each individual ProcessingBlock, each block being one specific filter or processing implementation.These have been divided into three categories: processing elements; filters; filters with windowing.
The last two categories stem purely from implementation considerations: filtering a data point based on a whole data window yields different requirements than filtering a data point independently of the data series.Filters represent a form of processing where data points are either kept or rejected.The processing elements on the other hand alter the 570 value of one or more data points.Filters are used to detect and reject invalid data while processing elements are used to correct the data (for instance, correcting a precipitation input for undercatch or a temperature sensor for a lack of ventilation).These processing elements can also be used for sensi-575 tivity studies, by adding an offset or multiplying by a given factor.
As shown in Fig. 6, each meteorological parameter is associated with a ProcessingStack object that contains a vector of ProcessingElement objects (generated through an object 580 factory).Each ProcessingElement object implements a specific data processing algorithm.The meteorological parameters mapping to their ProcessingStack is done in a standard map object.

585
Filters are used to detect and reject invalid data and therefore either keep or reject data points but don't modify their value.Often an optional keyword "soft" has been defined that gives some flexibility to the filter.The following filters have been implemented: 590 min, max, min_max.These filters reject out of range values or reset them to the closest bound if "soft" is defined; rate.This filters out data points if the rate of change is larger than a given value.Both a positive and a negative rate of change can be defined, for example for a different snow  (Goring and Nikora, 2002); unheated rain gauge.This removes precipitation inputs that don't seem physical.The criteria that is used is that for precipitation to really occur, the air and surface temperatures 605 must be at most three degrees apart and relative humidity must be greater than 50 %.This filter is used to remove invalid measurements from snow melting in an unheated rain gauge after a snow storm.

Processing elements 610
Processing elements represent processing that alters the value of one or more data points, usually to correct the data.The following processing elements have been implemented: mean, median or wind average.Averages over a userspecified period.The period is defined as a minimum dura-615 tion and a minimum number of points.The window centering can be specified, either left, center or right.The wind averaging performs the averaging on the wind vector; Exponential or Weighted Moving Average.Smooths the data either with an Exponential or Weighted Moving Average (EMA, WMA respectively) smoothing algorithm; 2 poles, low pass Butterworth.Low pass filter according to (Butterworth, 1930); add, mult, suppr.This makes it possible to add an offset or multiply by a given factor (constant or either hourly, daily or monthly and provided in a file), for sensitivity studies or climate change scenarios or totally delete a given meteorological parameter; unventillated temperature sensor correction.Corrects a temperature measurement for the radiative heating on an unventilated sensor, according to (Nakamura and Mahrt, 2005) or (Huwald et al., 2009); undercatch.Several corrections are offered for precipitation undercatch, either following (Hamon, 1972;Førland and Institutt, 1996) or following the WMO corrections (Goodison et al., 1997).Overall, the correction coefficients for fifteen different rain gauges have been implemented.Since the WMO corrections were not available for shielded Hellmann rain gauges, a fit has been computed based on published data (Wagner, 2009;Daqing et al., 1999).The correction for the Japanese RT-3 rain gauges has been implemented following Yokoyama et al. (2003).It is also possible to specify fixed correction coefficients for snow and mixed precipitation; precipitation distribution.The precipitation sum can be distributed over preceding timesteps.This is useful for example when daily sums of precipitation are written at the end of the day in an otherwise hourly data set.
The data window can also be configured by the end user: by default the data is centered around the requested data point.But it is also possible to force the data window to be left or right centered.An extra option "soft" allows the data window to be centered as specified by the end user if applicable or to shift the window according to a "best effort" strategy if the data don't permit the requested centering.

Resampling
If the timestamp requested by the caller is not present in the data (either it has been filtered out or it was not present from the beginning), temporal interpolations will be performed.The Meteo1DInterpolator is responsible for calling a temporal interpolation method for each meteorological parameter as configured by the end user.The end user chooses between the following methods of temporal interpolation for each meteorological parameter separately: no interpolation.If data exists for the requested timestamp it will be returned or remain nodata otherwise; nearest neighbour.The closest data point in the raw data that is not nodata is returned; linear.The value is linearly interpolated between the two closest data points; accumulation.The raw data is accumulated over the period provided as argument; daily solar sum.The potential solar radiation is generated as to match the daily sum as provided in the input data.
These methods must be able to both downsample and upsample according to the needs (except the daily solar sum).

675
These methods take a time series as argument and a timestamp and return the interpolated value for a given meteorological parameter.The ability to support an arbitrary and variable sampling rate for both the input and output data prevents the utilisation of well known signal analysis algo-680 rithms.Moreover some meteorological parameters require a specific processing, such as precipitation that must be accumulated over a given period.The following approach has therefore been implemented (see in Fig. 7): for each requested data point, if the exact timestamp cannot be found 685 or in case of reaccumulation, the index where the new point should be inserted will be sought first.Then the previous valid point is sought within a user-configured search distance.The next valid point is then sought within the userconfigured search distance from the first point.Then the re-690 sampling strategy (nearest neighbour, linear or reaccumulation) uses these points to generate the resampled value.Other resampling algorithms may be implemented by the user that would use more data points.
When no previous or next point can be found, the resam-695 pling extrapolates the requested value by looking at more valid data points respectively before or after the previously found valid points.Because of the significantly increased risk of generating a grossly out of bound value, this behaviour must be explicitly enabled by the end user.

Data generators
In order to be able to return a value for a given timestamp there must be enough data available in the original data source.This data has to pass the filters set up by the end user and may then be used for resampling.In case that 705 data is absent or filtered out there is still a stage of last resort: the data can be generated by a parametrization relying on other parameters.The end user configures a list of algorithms for each meteorological parameter.These algorithms are implemented as classes inheriting from the GeneratorAl-710 gorithms.The DataGenerator class acts as their high level interface.The algorithms range from very basic, such as assigning a constant value, to quite elaborate.For instance the measured incoming solar radiation is compared to the potential solar radiation resulting in a solar index.The solar index 715 is used in a parametrization to compute a cloud cover that is given to another parametrization to compute a long wave radiation.The GeneratorAlgorithms receive a set of meteorological parameters for one point and one timestamp.The DataGenerator walks through the user configured list of generators, in the order of their declaration by the end user, until a valid value can be returned.The returned value is inserted into the data set and either returned to the caller or used for spatial interpolations.
The following generators have been implemented: standard pressure.Generates a standard pressure that only depends on the elevation; constant.Generates a constant value as provided by the user; sinusoidal.Generates a value with sinusoidal variation, either on a daily or a yearly period.The minimum and maximum values are given as arguments as well as the position of the first minimum; relative humidity.Generates a relative humidity value from either dew point temperature or specific humidity; clearsky_lw.Generates a clear sky incoming long wave radiation, choosing between several parametrizations (Brutsaert, 1975;Dilley and O'brien, 1998;Prata, 1996;Clark and Allen, 1978;Tang et al., 2004;Idso, 1981); allsky_lw.Generates an incoming long wave radiation based on cloudiness.If there is no cloudiness available, it will be parametrized from the solar index (the ratio between measured incoming short wave radiation and potential radiation, Iqbal, 1983) according to Kasten and Czeplak (1980).
If no incoming short wave radiation is available but a reflected short wave radiation is available, a snow albedo of 0.85 will be assumed for measured snow heights greater than 10 cm and a grass albedo of 0.23 otherwise.If no measured snow height is available, a constant 0.5 albedo will be assumed.It is possible to chose between several parametrizations (Unsworth and Monteith, 1975;Omstedt, 1990;Crawford and Duchon, 1999;Konzelmann et al., 1994); potential radiation.Generate an incoming short wave radiation (or reflected short wave radiation) from a measured long wave radiation using a reciprocal Unsworth generator.

Spatial interpolations
If the caller requests spatial grids filled with a specific parameter, two cases may arise: either the data plug-in reads the data as grids and can directly return the proper grid or it reads the data as point measurements.In this case, the data must be spatially interpolated.The end user configures a list of potential algorithms and sets the respective arguments to use for each meteorological parameter.
The Meteo2DInterpolator reads the user configuration and evaluates for each parameter and at each time step which algorithm should be used for the current time step, using a simple heuristic provided by the interpolation algorithm itself.Of course, relying on simple heuristics for determining which algorithm should be used does not guarantee that the best result will be attained but should nonetheless suffice most of the time.This implies a trade-off between accuracy (selecting the absolutly best method) and efficiency (not spending too much time selecting a method that most probably is the one determined by the heuristic).The objective is to ensure robust execution despite the vast diversity of con-ditions.The number of available data points often eminently influences the applicability of a given algorithm and without the flexibility to define fall-back algorithms frequent disruptions of the process in an operational scenario might ensue.

780
Most spatial interpolations are performed using a trend/residuals approach: the point measurements are first detrended in elevation, then the residuals are spatially interpolated and for each pixel of the resulting grid the elevation trend back is applied.Of course, the user can 785 specify an algorithm that does not include detrending.
The following spatial interpolations have been implemented: filling the domain with a constant value (using the average of all stations); 790 filling the domain with a constant value with a lapse rate (assuming the average value occurs at the average of the elevations); filling the domain with a standard pressure that only depends on the elevation at each cell;

795
spatially interpolating the dew point temperature before converting it back to a relative humidity at each cell as in Liston and Elder (2006); spatially interpolating the atmospheric emissivity before converting it back to an incoming long wave ra-800 diation at each cell; inverse distance weighting (IDW) with or without a lapse rate; spatially interpolating the wind speed and correcting it at each point depending on the local curvature as in 805 Ryan (1977); spatially interpolating the wind speed and correcting it at each point depending on the local curvature as in Liston and Elder (2006); spatially interpolating the precipitation, then pushing 810 the precipitation down the steep slopes as in Spence and Bavay (2013); ordinary kriging with or without a lapse rate as in Goovaerts (1997) with variogram models as in Cressie (1992);

815
spatially interpolating the precipitation and correcting it at each point depending on the topographic wind exposure as in Winstral et al. (2002); loading user-supplied grids; finally, it is also possible to activate a "pass-through" 820 method that simply returns a grid filled with nodata.
Relying on the fall-back mechanism described above it is, for example, possible to configure the spatial interpolations to read user-supplied grids for some specific time steps, reverting to ordinary kriging with a lapse rate if enough stations can provide data and no user-supplied grids are available for this time step, reverting to filling the grid with the measurements from a single station with a standardized lapse rate if nothing else can be done.Everything happens transparently from the point of view of the caller.

Lapse rates
Due to the fact that for many meteorological parameters the altitudinal lapse rates are a dominant factor in mountainous areas, properly handling them is of utmost importance for spatial interpolations.This becomes a real issue for fully au-835 tomated simulations: it is possible that some outliers significantly degrade the computed lapse rate or that no real lapse rate can be found in the data.Therefore the following process is used to determine the lapse rate: 1. the lapse rate is computed; 840 2. if the lapse rate's correlation coefficient is better than a 0.7 threshold, the determined lapse rate will be used as such; 3. if this is not the case, the point that degrades the correlation coefficient the most will be sought: for each point, the correlation coefficient is computed without this point.The point whose exclusion leads to the highest correlation coefficient is suppressed from the data set for this meteorological parameter and at this time step; 4. if the correlation coefficient after excluding the point determined at 3 is better than the 0.7 threshold, the determined lapse rate will be used as such, otherwise the process will loop back to point 3.
The process runs until at most 15 % of the original data set points have been suppressed or when the total number of points falls to four, in order to keep a reasonable number of points in the data set.This is illustrated in Fig. 8: the initial set of points has a correlation coefficient that is lower than the threshold, leading to the removal of the three points in the right hand side panel, resulting in a coefficient above the threshold.
Finally, most of the spatial interpolations algorithms offer their own fall-back for the lapse rate: it is often possible to manually specify a lapse rate to be used when the data-driven lapse rate has a correlation coefficient that remains less than 865 the 0.7 threshold.

Grid rescaling
Rescaling gridded meteorological data to a different resolution is often necessary for reading a grid (and bringing it in line with the DEM grid) or for writing a grid out (for exam-870 ple, as a graphical output).Since meteorological parameters at the newly created grid points mostly depend on their immediate neighbours and in order to keep the computational costs low, standard image processing techniques have been used: the rescaling can either be done by applying the nearest 875 neighbour, bi-linear or cubic B-spline algorithms.These algorithms are very efficient and appropriate for rescaling grids to a higher resolution without any matching DEM since no gradient correction will be performed.

Miscellaneous utilities 880
In order to provide common algorithms to the various components, several classes have been designed that implement well known algorithms.These classes have been implemented in quite a generic way, striving for readability, stability -no surprising behaviour -and acceptable performance.

885
A basic set of monodimensional statistical algorithms have been implemented as they are often required by the filters or the spatial interpolation methods.These are completed by a least square regression solver that can be used on any statistical model by inheriting from a base class and implementing 890 the model itself.This required a basic set of arithmetic matrix operations, also required for kriging.The Matrix class strives to remain as close as possible to the standard mathematical notation and implements all the basic operations: addition, subtraction, multiplication, determinant, transpo-895 sition.The generic inversion is implemented by first performing the LU factorization (using the Doolittle algorithm Duff et al., 1986) and then backward and forward solving of LU × A −1 = I (Press et al., 1992).This represents a good balance between complexity and efficiency since more ad-900 vanced methods provide benefits only for very large matrices.For the case of tridiagonal matrices, the Thomas algorithm is offered (Thomas, 1949).
In order to isolate platform specific code, several classes and functions have been implemented: functions dealing with 905 file and path handling, such as checking if a file name is valid, if a file exists, the copying of files, extracting a path or an extension and microsecond resolution timers.The timers are offered for benchmarking purposes with a resolution of up to 1 ns with very low overhead.

910
Finally, as required by several filters and data generators, a set of algorithms for computing atmospheric and solar properties have been implemented.The solar position is computed with the Meeus algorithm (Meeus, 1998) and the potential radiation according to Iqbal (1983).Reprojection 915 functions (between beam, horizontal and slope) are also offered alongside.

Optimizations
In order to optimize the algorithms based on distances, such as inverse distance weighting, it has been necessary to opti-mize the computation of expressions such as 1/ √ x.This has been achieved through a fast inverse square root approximation implementation (Lomont, 2003) that has been shown to give at most 1.7 % relative error and deliver at least a four times speed up.Similarly, a method for fast computation of cubic roots has been implemented based on a single iteration Halley's method with a bit hack approximation providing the seed (Lancaster, 1942) and a fast computation of powers based on bit hacks and exponentiation by squaring (Montgomery, 1987).These are grouped in a specific namespace and header file alongside other numerical optimizations (Hastings et al., 1955).

Benchmarks
Several numerical models developed by different institutions rely on MeteoIO for their I/O needs.Several specialized ap-935 plications (mostly web services) have also been developed in different countries based on MeteoIO.It is also used regularly for several warning systems and research projects around the world.Such applications include the Common Information Platform for Natural Hazards GIN (gin, 2014), plications, some benchmarks are presented in this section.These have been conducted on a recent octo-core computer powered by a 64 bits Intel Core i7 processor (3612QM) equipped with 8 GB of RAM.The processor runs between 1.2 and 3.1 GHz and reaches a CPU Mark of 6834 (http:// www.cpubenchmark.net/).The benchmarks have been compiled by the GNU Compiler Collection (GCC) version 4.7.2 both for C++, C and Fortran.

Ease of extension
In order to check if it is really easy for third parties to con-955 tribute to MeteoIO, a test was set up asking participants to develop a basic filter.The filter that had to be developed is a simple filter on the incoming long wave radiation, rejecting all data outside min σT 4 and max σT 4 .
Once their system was properly configured (and checked by running a simple test), the participants were provided with a sheet with instructions and questions and asked to implement the required filter following the official documentation, working alone and without assistance.Ten participants took the test, including eight PhD students.The participants use 965 computers for their daily work (mostly using Matlab or R) with only four participants having a previous experience in C or C++.In order to better discriminate between the overhead (ie integrating one's development within MeteoIO) and the intrinsic complexity of the required processing (i.e. the logic of the filter that had to be implemented), the participants were asked to first write an empty filter and then to implement the logic of the filter.The results show that for an average user (the median of the results), writing, compiling and testing an empty filter re-975 quires 40 min while implementing and testing the real filter requires 50 min.Since only a limited number of users did participate in this test, this tends to show a worst case scenario by being overly sensitive to specific issues: one user spent quite a lot of time trying to make his test work, only to real-980 ize that he was not testing with his latest changes, another one used a wrong test dataset, etc Based on the response of the test users themselves, the initial programming abilities were not really a major factor in their achievements but mostly the ability to follow the step by step instructions.

Meteorological data processing benchmarks
Reading meteorological data stored in an ASCII file bears a significant overhead.The file needs to be read line by line, each line needs to be split up based on a predefined delimiter, the values need to be converted from strings to their re-990 spective data types and the data need to be stored in memory for further processing.A comparative illustration of different programming environments and their performance in completing the aforementioned task for a 873 kB file containing hourly data for one station and one year is given in Fig. 9.

995
The GNU compilers gcc, g++ and gfortran were used to obtain the benchmark executables.Clearly C++ and MeteoIO, which is programmed in C++ and utilises the GNU STL and streams implementations, show the same performance.The efficient dynamic memory management gives C the over-1000 all advantage, whereas Fortran95 (static) shows good performance for parsing values to doubles with the drawback, that the exact layout and size of the file need to be known at compile time.Allowing these properties to be dynamic, slows down the performance.Apart from only reading the 1005 data, MeteoIO performs a unit conversion and finally stores the data in MeteoData objects which are then used for further processing and exposed to the user.
Figure 10 illustrates the performance gain in the course of 3 years of MeteoIO development when resampling hourly 1010 data for one station to 20 min.Data is read from a 11 Mb SMET ASCII file that contains hourly measurements of 11 parameters for a period of 12 years for one weather station.Contrary to the other benchmarks, this benchmark has been conducted on a 2006 laptop powered by a 32 bits Intel Core 1015 Duo processor (T2300) that represents the lower end of what could still be found at the workplace.The most significant performance gain was achieved between versions 2.1.1 and 2.1.2following the redesign of the core MeteoData class, representing all measured parameters of one station at one 1020 moment in time.Since MeteoData objects are copied and instantiated during all processing steps focusing on the performance of the copy constructor yielded a spectacular per-formance boost.Further improvements leading up to version 2.1.3are mainly comprised of an efficient use of file pointers regarding I/O and a redesign of the processing capabilities, namely reducing the amount of copies necessary when dealing with series of data points during filtering and resampling.Optimizations in all parts of the code bring about a constant improvement of the MeteoIO performance albeit a significant increase of features and requirements.The strategy throughout development is to write correct code following best practice design rules, to then profile it using static and dynamic analysis tools as layed out in 2.7 and to optimize where significant improvements can be expected based on the results of the profiling.

Spatial interpolations benchmarks
Unsurprisingly, most of the spatial interpolation algorithms scale as O(n).However, since there is some overhead (constructing the required spatial interpolator, setting the grid metadata, gathering the necessary data) it is interesting to see how the real world scalability is.To this effect, the "passthrough" interpolation has been used that fills the grid with nodata by calling the optimized STL methods on the underlying data container.Different spatial interpolations have been benchmarked for different grid sizes, ranging from one cell to 25 million cells.Two scenarios have been used: one providing seven meteorological stations as input and one providing fourteen meteorological stations as input, each station providing two months of hourly data in a 130 kb file.
The results are shown in Fig. 11.The linear behaviour starts to be visible after around 0.3 ms which would then be the total overhead for spatial interpolations.This overhead also depends on the chosen algorithm: for example the simple pass-through has a very low 0.1 ms overhead (there is nothing to prepare before filling the grid) to 0.4 ms for ordinary kriging with fourteen stations (the necessary matrices have to be computed with the station data before filling the grid).
One can also witness the effect of STL optimizations: the pass-through interpolation fills the whole grid with the same constant value, relying on the STL to perform the task.On the other hand, the CST interpolation fills the grid with a constant value but only for cells that have an elevation in their associated DEM, therefore not relying on an STL method for doing it.This makes it 3.5 times slower.When using the same method but with detrending, three passes through the grid are required (detrending, filling the grid, retrending) leading to a factor two slowdown as seen in Fig. 11.When using an inverse distance weighting, the distance has to be computed for each pixel.This depends on the number of stations (thus the difference between IDW for seven or fourteen stations) but this also significantly slows down the processing (despite using a fast approximation for calculating the distance).This costs another factor five compared to a simple constant fill.Finally, the ordinary kriging requires to fill and invert a matrix of dimension N stations × N stations and then to perform a matrix multiplication for each pixel.This leads to a larger overhead (visible for small grids that exhibit a nonlinear behaviour depending on the number of stations) and 1080 another ten times slowdown compared to IDW_LAPSE.

Usage scenarios benchmarks
Although the MeteoIO library offers various components of interest to numerical models, its primary usage is to read, preprocess and potentially spatially interpolate meteorologi-1085 cal data.Therefore, a benchmark has been set up with various scenarios, all based on fifteen stations providing hourly data for fourteen years (representing 14 Mb on average).The data is read from files, all parameters are checked for min/max, the precipitation is corrected for undercatch, the snow height 1090 is filtered for its rate of change, then all parameters are linearly interpolated if necessary while the precipitation is reaccumulated.The spatial interpolations rely on IDW for some parameters, WINSTRAL for the precipitation and LISTON for the wind direction.The following scenarios have been 1095 defined: reading the hourly data necessary to simulate one season either at the beginning of the file or at the end, reading the hourly data required to simulate the full period contained in the files and reading and spatially interpolating hourly data once per hour for a day over a 6500 cells DEM.

1100
These scenarios have been further split in two: only one station providing the data or fifteen stations providing the data.For the spatial interpolations, since it made little sense to benchmark very specific algorithms able to handle only one station, then seven and fifteen stations have been used.

1105
The results are presented in Fig. 12. First, this shows that when the data is extracted as time series, the preprocessing time is negligible compared to what would be a realistic model run time.Working on raw data and preprocessing the data on the fly thus does not introduce any performance 1110 penalty.Then the run time scales linearly both with the number of stations and with the total duration, although the position of the data in the file has a direct impact on the performances.For the case of the spatial interpolations, the scaling is still linear with the number of stations but the performance 1115 impact is much higher; however this should still remain much lower than most models timestep duration.

Code availability
The MeteoIO library is available under the GNU Lesser General Public License v3.0 (LGPL v3) on http://models.1120 slf.ch/p/meteoio/ both as source code (from the source version control system or as packages) or as precompiled binaries for various platforms.Stable releases are announced on https://freecode.com/projects/meteoio.
The documentation must be generated from the source 1125 code or is available as html in the precompiled packages.The documentation for the last stable release is available online at http://models.slf.ch/docserver/meteoio/html/index.html.Detailed installation instructions are available at http://models.slf.ch/p/meteoio/page/Getting-started/.
code as well as to peek into the data that is sent to the core numerical routines.This has also lead to fruitful developments in the preprocessing stage much beyond what was originally performed on the numerical models.A careful design made it possible for casual users to easily contribute to data filters or parametrizations.This ease of contribution to MeteoIO make it a great test bed for new preprocessing methods with a direct link to actual numerical models.A contributor with little or no previous C++ experience can contribute simple algorithms with a relatively minor time investment.In terms of 1145 performance, continuous benchmarking and profiling have lead to major improvements and keep the preprocessing computational costs well balanced compared to the data acquisition costs.Today, the MeteoIO library offers great flexibility, reliabil-1150 ity and performance and has been adopted by several models for their I/O needs.These models have all benefited from the shared developments in MeteoIO and as such offer an increased range of application and an increased robustness in regard to their forcing data.

110---
Allow the data model to be easily expanded (data model scalability); Make it possible to easily change the data source (format and/or protocol) without any change in the model code itself; 115 Preprocess the data on the fly; 235 250 345 flow overview 465 At the core of MeteoIO lies the process of getting for a specific time step either a set of meteorological data or a set of spatially interpolated meteorological data.The model using MeteoIO for getting its meteorological time series relies on the very simple call given in listing 1.This call returns 470 485 2. filtering and correcting the data; 3. temporally interpolating (or resampling) the data if necessary; 4. generating data from parametrizations if everything else failed; 490 5. spatially interpolating the data if requested.

595
accumulation and snow ablation rate; standard deviation.All values outside of ŷ ± 3σ are removed; median absolute deviation.All values outside ŷ ± 3σ MAD are removed; 600 Tukey.Spike detection following 700 985

Figure 5 .
Figure 5. Meteorological data reading and processing workflow.The classes marked API are designed to be called by the user and the classes marked EXP are designed to be expanded by the user.

Figure 11 .
Figure11.Benchmarks of some spatial interpolation algorithms for various grid sizes for seven input stations (plain lines) and fourteen input stations (dotted lines).
Listing 1. MeteoIO call used by models to request all available meteorological time series for a given time step.MeteoIO call used by models to request spatially interpolated parameters for a given time step.Isolation of the data reading and preprocessing routines from the numerical model.Figure 2. Manager/worker architecture; very often the interface and the manager are implemented in the same class, the interface being the public interface and the manager being the private implementation.