Authors response to referee comments: MLAir (v1.0)-a tool to enable fast and flexible machine learning on air data time series

With MLAir (Machine Learning on Air data) we created a software environment that simplifies and accelerates the exploration of new machine learning (ML) models, specifically shallow and deep neural networks, for the analysis and forecasting of meteorological and air quality time series. Thereby MLAir is not developed as an abstract workflow, but hand in hand with actual scientific questions. It thus addresses scientists with either a meteorological or an ML background. Due to their relative ease of use and spectacular results in other application areas, neural networks and other ML methods are gaining 5 enormous momentum also in the weather and air quality research communities. Even though there are already many books and tutorials describing how to conduct an ML experiment, there are many stumbling blocks for a newcomer. In contrast, people familiar with ML concepts and technology often have difficulties understanding the nature of atmospheric data. With MLAir we have addressed a number of these pitfalls so that it becomes easier for scientists of both domains to rapidly start off their ML application. MLAir has been developed in such a way that it is easy to use and is designed from the very beginning as 10 a standalone, fully functional experiment. Due to its flexible, modular code base, code modifications are easy and personal experiment schedules can be quickly derived. The package also includes a set of simple validation tools to facilitate the evaluation of ML results using standard meteorological statistics. MLAir can easily be ported onto different computing environments from desktop workstations to high-end supercomputers with or without graphics processing units (GPU). Copyright statement. TEXT 15


General statement
We would like to thank the two referees for their review of our manuscript and the many very helpful comments and suggestions for improvement. We would also like to thank Christoph Knote for being the topical editor and for his intensive search for referees. In this document, we will address all the referees' comments one by one and present the modifications made to the manuscript. For better clarity, our responses and changes in the text that refer to Referee #1's comments are highlighted in 5 blue. For responses and amendments related to Referee #2 comments, the highlighting is in red. Other adjustments without a direct relation to one of the referees are highlighted in magenta. Line numbers given in this document always refer to the peer-reviewed document and not to the corrected version.
2 Answer to Anonymous Referee #1 "The manuscript describes a library which facilitates the development of end-to-end neural network workflows (though the 10 name suggests it may support other ML algorithms) for time series forecasting (mostly focused on air quality).
Though the use case is fairly narrow in scope, the architecture of MLAir and the various features to validate the models using standard meteorological metrics is very interesting. MLAir's design also allows for development of reproducible ML pipelines which is essential if ML techniques are to be more widely used by the climate science community.
Overall, the aims and implementation of MLAir serve as a good template for future development of such frameworks and I 15 consider it a valuable contribution to the geoscientific community.
That being said, it seems to me that certain parts of the manuscript could be improved to enhance readability."

Major Comments
1. "It would be helpful if the abstract clarified that MLAir is focused on neural networks and not any other kind of ML algorithm." 20 We have added that MLAir is focusing on shallow and deep neural networks. "With MLAir (Machine Learning on Air data) we created a software environment that simplifies and accelerates the exploration of new machine learning (ML) models, specifically shallow and deep neural networks, for the analysis and forecasting of meteorological and air quality time series." (l.1) 2. "It would really help if the manuscript used different styles for conceptually different things. For instance, libraries (e.g., 25 TensorFlow) are italicized, and so are class names (e.g., Data Handler). It would be preferable if the typewriter font is used to strengthen the correspondence with the figures (DataHandler). If TensorFlow is italicized, why is MLAir not italicized?
If names like Data Handler refer to class names, then it is unclear why the names are split. If it describes the function performed by a piece of code, then it is not clear why it is italicized. 30 Similar issues are found in the caption of Figure 1. Are Experiment Setup, Preprocessing, Model Setup class names or verbs? Is Hyperparameter a class name or terminology from ML? If it is the latter, why is it italicized, and why if it is the former, then why is it not mentioned in Line 91? Similarly, the text used in the various bubbles in Figure 1 itself could benefit from such formatting (is Run Environment a class name or a simply a description?).
As it stands, the manuscript's treatment of different terminologies is too confusing for the reader to precisely understand 35 what each term stands for. The manuscript would also benefit from a sentence or two stating the typographical choices made by the authors." We have to agree that our text styling was not clearly chosen and, further, was not consistently implemented throughout the manuscript. Therefore, we thank Reviewer #1 for this comment and the proposed solution. We have agreed on the following. Frameworks, including MLAir, are italicised. For code elements such as class names and variables, the 40 typewriter font is used. All other expressions, e.g. those that describe a class but do not explicitly name it, will not be highlighted in the text at all.
We have updated all corresponding text passages according to the convention described above.
We also have added a short statement at the end of the introduction to easily understand the styles' meaning.
3. "Section 2.1 would benefit from a definition of the main components -Run Environment, Data Handler, Model Class plained (previously 3.2 to 3.4). In the newly created chapter 3, the order corresponds exactly to the former sections 2.2 to 2.4 . 4. "To ensure the a reader with no background in ML is able to relate to the design of MLAir, a brief introduction to a typical ML workflow (test, train, validation, epoch, hyperparameter, etc.,) would be really helpful in Section 2.1." We have added two paragraphs explaining ML in short and how a ML workflow could look like in the beginning of section 2. 60 5. "In Section 2.2, window_history_size is not defined anywhere. Therefore, what it does and how it might impact network architecture (Line 128-129) is not clear." We have adapted the manuscript in two places regarding this. First, we have referenced more precisely in l. 122 what the parameter window_history_size stands for.
"Therefore, we need to adjust the parameter window_history_size for the former and stations for the latter in 65 the run call." We have also added the following sentences to l. 129 in order to answer the question of how exactly the architecture is influenced.
"This is made possible since the model class in MLAir queries the shape of the input variables and adapts the architecture of the input layer accordingly. Naturally, this procedure does not make perfect sense for every model, as it only affects 70 the first layer of the model. In case the shape of the input data changes to a large extent, it is advisable to adapt the entire model as well." 6. "Line 142: The term epoch is used without defining it anywhere. Definitions and motivations for such terms could be added to the brief description of a typical ML workflow that was mentioned previously." We have added a short explanation for the term epoch in the new short introduction to ML. 75 7. "Line 220: If the skill score is defined as a ratio of a metric like MSE, how does one obtain positive and negative skill?" We have expressed ourselves in a misleading way at this point. The skill score depends on the ratio but is not directly defined as the ratio. We have now split this information into two sentences and described it more precisely.
"For the comparison, we use a skill score S, which is naturally defined as the performance of a new forecast compared to a competitive reference with respect to a statistical metric (Murphy and Daan, 1985). Applying the mean squared error 80 extrapolate out-of-sample. If the skill score is now calculated in comparison to the original prediction, it is to be expected that it will be negative because the information of the sampled variable has been lost. Conversely, the greater this drop, the stronger is the impact of this input variable on the prediction: 90 "In addition to the statistical model evaluation, MLAir also allows to assess the importance of individual input variables through bootstrapping of individual input variables. For this, the time series of each individual input variable is resampled n times (with replacement) and then fed to the trained network. By resampling a single input variable, its temporal information is disturbed, but the general frequency distribution is preserved. The latter is important because it ensures that the model is provided only with values from a known range and does not extrapolate out-of-sample. Afterwards, the 95 skill scores of the bootstrapped predictions are calculated using the original forecast as reference. If an input variable is important to achieve a good model forecast, it will thus show up with a large negative skill score in the bootstrap skill score plot (Fig. 13). Input variables that show an overly negative skill score during bootstrapping have a stronger influence on the prediction than input variables with a small negative skill score. In case the bootstrapped skill score even reaches the positive value domain, this could be an indication that the examined variable has no influence on the prediction at 100 all. The result of this approach applied to all input variables is presented in PlotBootstrapSkillScore (Fig. 13).
A more detailed description of this approach is given in Kleinert et al. (2021)." 9. " Figure 12: I might have missed something, but it is unclear what AI, BI, CI, CASEI etc., are. Furthermore, it is unclear what "terms" and "contributing terms" (Line 226) are." We have slightly adjusted the sentence in l.226 and included a brief description on the climatological skill scores in the 105 caption of Fig. 12. "... and summarized as a full-detail box-and-whiskers plot over all stations and forecasts with all contributing terms ( Fig. 12), and as simplified version showing the skill score only (not shown)." (l.226) "Climatological skill scores (CASE I to IV) and related terms of the decomposition as proposed in Murphy (1988). Skill scores and terms are shown separately for all forecast steps (dark to light blue). In brief, CASE I to IV describe 110 a comparison with climatological reference values evaluated on the test data. CASE I is the comparison of the forecast with a single mean value formed on the training and validation data and CASE II with the (multi-value) monthly mean.
The climatological references for CASE III and IV are, analogous to CASE I and II, the single and the multi-value mean, however, on the test data. CASE I to IV are calculated from the terms AI to CIV. For more detailed explanations of the cases, we refer to Murphy (1988)." (Fig. 12) "This can be applied during training by using the extreme_values parameter, which defines a threshold value at which a value is considered extreme. Training samples with target values that exceed this limit are then used a second time in each epoch. It is also possible to enter more than one value for the parameter. In this case, samples with values that exceed several limits are duplicated according to the number of limits exceeded.

Minor Comments
1. "Line 58: It also allows to deploy -> It also allows deploying" Corrected 2. "Line 59: use of GPUs . I think it is proper to acknowledge that usage of GPUs is due to the underlying tensorflow library" 130 We added proposed acknowledgement "It also allows deploying typical optimization techniques in ML workflows, and offers further technical features like the use of graphics processing units (GPU) due to the underlying ML library." 3. "Line 60: Concurrent to a simple usage with low barriers for ML-callow scientists,: Not sure what is meant by this sentence." 135 We have rephrased the sentence.
"MLAir is suitable for ML beginners by its simple usage, but also offers high customization potential for advanced ML users and can therefore be employed in real-world applications." 4. "Line 84: as many customization -> To facilitate customization(?)" We have rephrased the beginning of the corresponding sentence to be more clear.

140
"In order to enable a wide range of adaptations but also to support the users sufficiently, MLAir had to be designed as an end-to-end workflow comprising all required steps of the time series forecasting task." 5. "Line 83: Why is Workflow capitalized?" We had originally capitalized all nouns. We have now lowered all nouns in headings.
6. " Figure 1: exemplary: exemplary usually means commendable. I don't think that is what the authors meant." 145 We have changed the beginning to "An exemplary example implementation of a little model using ..." (l.417) 7. " Figure 6: exemplary measurement stations: Same as above. This applies to all other places where exemplary has been used." Done 150 8. "Line 110: daily aggregated: does this mean daily mean? daily max? It is unclear how the aggregation is done." We have added the reference to a table with more details on the aggregation because it depends on the variable. "In the default configuration, 21-year time series of nine variables from five stations are retrieved with a daily aggregated resolution (see Table 3 for details on aggregation)." (l.109) 9. "Line 168: Beside the file, which contain the model -> Besides the file, which contains the model" 155 Done 10. "Line 181: intended to add own graphics in MLAir -> intended to add custom graphics in MLAir" Done 11. "Line 185: major waters -> major water bodies" Done 160 12. "Line 191: are meant to give an insight into -> are meant to provide insight into" Done 13. "Line 195: each month separately as box-and-whisker -> each month separately as a box-and-whisker diagram" Done 14. "Line 205: marginal distribution is shown as histogram (lite grey) -> marginal distribution is shown as a histogram (light 165 grey)" Done 15. "Line 242: independent on the OS -> independent of the OS" Done 16. "Line 350: Since the spatial dependency of two distinct stations may variegate related ->Since the spatial dependency of 170 two distinct stations may vary(?) related" Yes, replaced by vary 17. "Line 433: how to implement a ML model -> how to implement an ML model" Done, also updated in l.4 and l.6.
18. "Line 436: is to assume an independent and identically distribution and therefore augment and randomly shuffle data to 175 produce a larger number of input samples with a broader variety. -> is to assume independent and identically distributed data and therefore augment and randomly shuffle the data to produce a larger number of input samples with a broader variety. Consider rewriting this sentence to enhance readability." We have split and rephrased the long sentence into two parts for better understanding.
"A popular technique in ML, especially in the image recognition field, is to augment and randomly shuffle data to produce 180 a larger number of input samples with a broader variety. This method requires independent and identically distributed data." 19. "Line 442: To address this issue, MLAir allows to place more -> To address this issue, MLAir allows placing more" Done 3 Answer to Anonymous Referee #2 185 "Through this manuscript, a new library specialized in Neural Networks for air quality forecasting is presented. In general terms, it is well written, the objectives are clear, and main features are correctly established.
Given that Neural Networks are gaining importance and widening their public, it is of high interest developing new tools that facilitate the intersection between this concrete field and any other area. In that sense, this work is an example of transparency 190 and reproducibility, especially important in the Artificial Intelligence domain." Comments 1. "The title is at some degree confusing. Using the term "Machine Learning" might lead to confusion, as the framework is solely based on deep learning." We prefer to keep the present title as MLAir can in principle be extended to other machine learning methods. However,195 addressing also comments by reviewer #1, we have added text in the abstract, the beginning of section 2 and in the limitations chapter to point out that the framework currently supports only shallow and deep neural networks.
2. "It feels like section 2 and 3 could be exchanged: before explaining how the framework is used, it might be interesting to know how the framework works. Also, the description of the framework is quite scarce and for a non-computer scientist reader it could be hard to follow. "

200
Following this comment, we have reordered chapters 2 and 3 to make MLAir easier to understand. To do this, we have integrated the entire section 3 into chapter 2. Sections 2.2 to 2.4, on the other hand, have been moved to chapter 3. In the new chapter 2, the programming language is now discussed first (previously 3.1), then the general design is briefly explained (previously 2.1) and then the concepts of run modules, model class and data handler are explained (previously 3.2 to 3.4). In the newly created chapter 3, the order corresponds exactly to the former sections 2.2 to 2.4 . 3. "Given that the manuscript aims to improve synergies between air quality experts/ practitioners and deep learning methodologies, some kind of introduction to deep learning main points is recommendable." We added a brief general introduction to machine learning and the typical ML workflow to section 2. A more comprehensive introduction to deep learning would clearly go beyond the scope of this paper and there is ample good learning material available on the internet. 4. "Some terms are not described before being used. Again, as in Section 2 the functioning and experimentation procedure is presented without any previous explanation of the general framework, sometimes it is confusing." This issue has been resolved by the restructuring related to comment 2. 5. "Grammar and typo errors should be carefully checked throughout the entire manuscript." We have double-checked the manuscript for typos and grammar errors, but as non-native speakers we are aware that our -Kluyver, T., B.,Pérez,F.,Granger,B.,Bussonnier,M.,Frederic,J.,Kelley,K.,Hamrick,J.,Grout,J.,Corlay,S.,Ivanov,P.,630 Avila, D., Abdalla, S., Willing, C., and development team, J.: Jupyter Notebooks -a publishing format for reproducible computational workflows, in: Positioning and Power in Academic Publishing: Players, Agents and Abstract. With MLAir (Machine Learning on Air data) we created a software environment that simplifies and accelerates the exploration of new machine learning (ML) models, specifically shallow and deep neural networks, for the analysis and forecasting of meteorological and air quality time series. Thereby MLAir is not developed as an abstract workflow, but hand in hand with actual scientific questions. It thus addresses scientists with either a meteorological or an ML background. Due to their relative ease of use and spectacular results in other application areas, neural networks and other ML methods are gaining 5 enormous momentum also in the weather and air quality research communities. Even though there are already many books and tutorials describing how to conduct an ML experiment, there are many stumbling blocks for a newcomer. In contrast, people familiar with ML concepts and technology often have difficulties understanding the nature of atmospheric data. With MLAir we have addressed a number of these pitfalls so that it becomes easier for scientists of both domains to rapidly start off their ML application. MLAir has been developed in such a way that it is easy to use and is designed from the very beginning as 10 a standalone, fully functional experiment. Due to its flexible, modular code base, code modifications are easy and personal experiment schedules can be quickly derived. The package also includes a set of simple validation tools to facilitate the evaluation of ML results using standard meteorological statistics. MLAir can easily be ported onto different computing environments from desktop workstations to high-end supercomputers with or without graphics processing units (GPU).

Introduction
In times of rising awareness of air quality and climate issues, the investigation of air quality and weather phenomena is moving into high focus. Trace substances such as ozone, nitrogen oxides or particulate matter pose a serious health hazard to humans, animals and nature (Cohen et al., 2005;Bentayeb et al., 2015;World Health Organization, 2013;Lefohn et al., 2018;Mills et al., 2018;US Environmental Protection Agency, 2020). Accordingly, the analysis and prediction of air quality are of great in many countries and has become a multi-million dollar industry, creating and selling specialized data products for many different target groups.
These days, forecasts of weather (as a common generic term for atmospheric chemistry, air quality, and meteorology) are 25 generally made with the help of so-called Eulerian grid point models. This type of models, which solve physical (and chemical) equations, operate on grid structures. In fact, however, local observations of weather and air quality are strongly influenced by the immediate environment. For instance, it is quite difficult for atmospheric chemistry models to represent very smallscale problems due to the limited grid resolution of these models and other limitations. Consequently, both global models and so-called small-scale models, whose grid resolution is still in the magnitude of about a kilometre and thus rather coarse 30 in comparison to local-scale phenomena in the vicinity of a measurement site, show a high uncertainty of the results (c.f. Vautard, 2012;Brunner et al., 2015). To enhance the model output, approaches focusing on the individual point measurements at weather and air quality monitoring stations through downscaling methods are applied allowing local effects to be taken into account. Unfortunately, these methods, being optimized for specific locations, cannot be generalized for other regions and need to be re-trained for each measurement site.

35
In a complementary way to traditional downscaling techniques like linear regression and other statistical methods, the use of machine learning (ML) is a promising approach to predict point observations. Methods such as neural networks are able to recognize and reproduce underlying and complex relationships in data sets. Especially driven by computer vision and speech recognition, technologies like convolutional neural networks (CNN, Lecun et al., 1998) or recurrent networks variations such as long short term memory (LSTM, Hochreiter and Schmidhuber, 1997)  Although the scientific areas of ML and meteorology exists for many years, combining both disciplines is still a formidable challenge, because scientists from these areas do not speak the same language. Meteorologists are used to build models on the basis of physical equations and empirical relationships from field experiments, and they evaluate their models with data.

45
In contrast, ML scientists use data to build their models on and evaluate either with additional independent data or physical constraints. This elementary difference can lead to misinterpretation of studies and results so that, for example, the ability of the network to generalize is misjudged. Another frequent problem of published studies on ML approaches to weather forecasting is an incomplete reporting of ML parameters, hyperparameters and data preparation steps that are key to comprehend and reproduce the work that was done. As shown by Musgrave et al. (2020) these issues are not limited to meteorological 50 applications of ML only.
To further advance the application of ML in the meteorological area, easily accessible solutions to run and document ML experiments together with readily available and fully documented benchmark data sets are urgently needed (c.f. Schultz et al., 2021, forthcoming). Such solutions need to be understandable by both, the ML and meteorological communities and help both sides to prevent unconscious blunders. A well-designed workflow embedded in a meteorological and ML related environment In this paper, we present a new framework to enable fast and flexible Machine Learning on Air data time series (MLAir). Fast means that MLAir is distributed as full end-to-end framework and thereby simple to deploy. It also allows deploying typical optimization techniques in ML workflows, and offers further technical features like the use of graphics processing units (GPU) due to the underlying ML library. MLAir is suitable for ML beginners by its simple usage, but also offers high customization 60 potential for advanced ML users and can therefore be employed in real-world applications. For example, more complex model architectures can be easily integrated. ML experts who want to explore weather data will find MLAir helpful as it enforces certain standards of the meteorological community. For example, its data preparation step acknowledges the auto-correlation which is typically seen in meteorological time series, and its validation package reports skill scores, i.e. improvement of the forecast compared to reference models such as persistence and climatology. From a software design perspective, MLAir has been developed according to state-of-the-art software development practices.
This work is structured as follows. Section 2 introduces MLAir by expounding the general design behind the MLAir workflow. We also share a few more general points about ML and how a typical workflow looks like. This is followed by section 3 showing three application examples to allow the reader to get a general understanding of the tool. Furthermore, we show how the results of an experiment conducted by MLAir are structured and which statistical analysis is applied. Section 4 extends 70 further into the configuration options of an experiment and details on customization. Section 5 delineates the limitations of MLAir and discusses for which applications the tool might not be suitable. Finally, section 6 concludes with an overview and outlook on planned developments for the future.
At this point we would like to point out that in order to simplify the readability of the manuscript, highlighting is used.
Frameworks are highlighted in italics and typewriter font is used for code elements such as class names or variables. Other 75 expressions that, for example, describe a class but do not explicitly name it, are not highlighted at all in the text. Last but not least, we would like to mention that MLAir is an Open Source project and contributions from all communities are welcome.

MLAir workflow and design
ML in general is the application of a learning algorithm to a data set that generates a model. During the so-called training process, the model learns patterns in the data set with the aid of the learning algorithm. Afterwards, this model can be applied 80 to new data. Since there is a large number of such learning algorithms and also an arbitrarily large number of different ML models, it is generally not possible to determine in advance which model will deliver the best results under which configuration. Therefore, the optimal setting must be find by trial and error.
ML experiments often follow similar patterns. First, data must be obtained, cleaned if necessary, and finally put into a suitable format (preprocessing). Next, an ML model is selected and configured (model setup). Then the learning algorithm can 85 optimize the model under the selected settings on the data. This is an iterative procedure, a single iteration is called epoch (training). The accuracy of the model is then evaluated (validation). If the results are still not satisfactory, the experiment is continued with other settings or a new model and the process starts again from the beginning. For further details on ML, we refer to Bishop (2006) and Goodfellow et al. (2016), but would also like to point out that there is a large amount of further introductory literature and freely available blog entries and videos, and that the books mentioned here are only two of many options out there.
The overall goal of designing MLAir was to create a ready-to-run ML application for the task of forecasting weather and air quality time series. The tool should allow many customization options to enable users to easily create a custom ML workflow, while at the same time it should support users in executing ML experiments properly and evaluate their results according to accepted standards of the meteorological community. At this point, it is pertinent to recall that MLAir's current focus is on 95 neural networks.
In this section we present the general concepts on which MLAir is based. We first comment on the choice of the underlying programming language and the used packages and frameworks (section 2.1). We then focus on the design considerations and choices and introduce the general workflow of MLAir (section 2.2). Thereafter we explain how the concept of run modules (section 2.3), model class (section 2.4) and data handler (section 2.5) was conceived and how these modules interact with each 100 other. More detailed information on, for example, how to adapt these modules can be found in the corresponding subsection of the later section 4.

Coding language
As underlying coding language python (Python Software Foundation, 2018, release 3.6.8) was used for two major reasons. First, python is pretty much independent of the operating system and is not required to be compiled before a run. python is flexible to 105 handle different tasks like data loading from web, training of the ML model or plotting. Numerical operations can be executed quite efficiently due to the fact that they are usually performed by highly optimized and compiled mathematical libraries.
Furthermore, because of its popularity in science and economics, python has a huge variety of freely available packages to use. Secondly, python is currently the language in the ML community (Elliott, 2019) and has well-developed easily-to-use frameworks like TensorFlow (Abadi et al., 2015) or PyTorch (Paszke et al., 2019) which are state-of-the-art tools to work on 110 ML problems. Due to the presence of such compiled frameworks, there is for instance no performance loss during the training, which is the biggest part of the ML workflow, by using python.
Concerning the ML framework, Keras (Chollet et al., 2015, release 2.2.4) was chosen for the ML parts using TensorFlow (release 1.13.1) as back-end. Keras is a framework that abstracts functionality out of its back-end by providing a simpler syntax and implementation. For advanced model architectures and features it is still possible to implement parts or even the 115 entire model in native TensorFlow by using the Keras front-end for training. Furthermore, TensorFlow has GPU support for training acceleration if a GPU device is available on the running system. pandas is an open source tool to analyse and manipulate data primarily designed for tabular data. xarray that was inspired by pandas is developed to work with multi-dimensional 120 arrays as simple and efficient as possible. xarray is based on the off-the-shelf python package for scientific computing NumPy

Design of the MLAir workflow
In order to enable a wide range of adaptations but also to support the users sufficiently, MLAir had to be designed as an 125 end-to-end workflow comprising all required steps of the time series forecasting task. The workflow of MLAir is controlled by a run environment, which provides a central data store, performs logging and ensures the orderly execution of a sequence of individual stages. Different workflows can be defined and executed under the umbrella of this environment. The standard MLAir workflow (described in section 2.3) contains a sequence of typical steps for ML experiments as indicated by Fig. 1: experiment setup, preprocessing, model setup, training, and postprocessing.

130
Besides the run environment, the experiment setup plays a very important role. During experiment setup, all customization and configuration modules, like the model class (section 2.4), data handler (section 2.5), or hyperparameters, are collected and made available to MLAir. Later in the ongoing workflow, these modules are then queried, e.g. the hyperparameters are used in training whereas the data handler is responsible for an accurate use of the data and therefore already used in the preprocessing.
We want to mention that apart from this default workflow, it is also possible to define completely new stages and integrate them 135 into a custom MLAir workflow (see section 4.8).

Run modules
MLAir models the ML workflow as a sequence of self-contained stages called run modules that handle distinct tasks whose calculations or results are usually required for all subsequent stages. At run time, all run modules can interchange information through a temporary data store. All run modules are executed sequentially upon successful termination of the precursor. Ad-140 vanced work flow concepts such as conditional execution of run modules, are not implemented in this version of MLAir. Also, run modules cannot be run in parallel, although a single run module can very well execute parallel code. In the default setup (c.f. Fig. 1), the MLAir workflow constitutes on the following run modules: -Run Environment: The run module RunEnvironment is the base class for all other run modules. By wrapping the RunEnvironment class around all run modules, parameters are tracked, the workflow logging is centralized, and 145 the temporary data store is initialized. After each run module and at the end of the experiment, RunEnvironment guarantees a smooth (experiment) closure by providing supplementary information on stage execution and parameter access from the data store.
-Experiment Setup: The initial stage of MLAir to set up the experiment workflow is called ExperimentSetup.
Parameters which are not customized are filled with default settings and stored for the experiment workflow. Furthermore, 150 all local paths for the experiment itself but also for data are created during experiment setup.
-Preprocessing: During the run module PreProcessing, MLAir loads all required data and carries out typical ML preparation steps to have the data ready-to use for training. If the DefaultDataHandler is used, this step includes downloading or loading of (locally stored) data, data transformation and interpolation. Finally, data are split into the subsets for training, validation, and testing.

155
-Model Setup: The ModelSetup run module builds the raw ML model implemented as a model class (see section 2.4), sets Keras and TensorFlow callbacks and checkpoints for the training, and finally compiles the model. Additionally, if using a pre-trained model, the weights of this model are loaded during this stage.
-Training: During the course of the Training run module, training and validation data are distributed according to the parameter batch_size to properly feed the ML model. According to the batch size, training and validation data are 160 distributed to properly feed the ML model. Right after, the actual training starts. After each epoch of training, the model performance is evaluated on validation data. If performance improved compared to previous cycles, the model is stored as best_model. In this way, the final model is the best training model according to validation performance.
-Postprocessing: In the final stage, PostProcessing, the trained model is statistically evaluated on the test data set.
For comparison, MLAir provides two additional forecasts, first an ordinary multi-linear least squared fit trained on the 165 same data like the ML model and second a persistence forecast, where observations of the past represent the forecast for the next steps within the prediction horizon. For daily data, the persistence forecast refers to the last observation of each sample to hold for all forecast steps. Skill scores based on the model training and evaluation metric are calculated for all forecasts and compared with climatological statistics. The evaluation results are saved as publication-ready graphics.
Furthermore, a bootstrapping technique is used to evaluate the importance of each input feature. More details on the 170 statistical analysis that is carried out can be found in section 3.3. Finally, an unpretentious geographical overview map containing all stations is created for convenience.
Ideally this predefined default workflow should meet the requirements for an entire end-to-end ML workflow on station-wise observational data. Nevertheless, MLAir provides options to customize the workflow according to the application needs (see section 4.8).

Model Class
In order to ensure a proper functioning of ML models, MLAir uses a model class, so that all models are created according to the same scheme. Inheriting from the AbstractModelClass guarantees a correct handling during the workflow. The model class is designed to follow an easy plug-and-play behaviour so that within this security mechanism, it is possible to create highly customized models with the frameworks Keras and TensorFlow. We know that wrapping such a class around each ML 180 model is slightly more complicated, but by requiring the user to build their models in the style of a model class, the model structure can be documented more easily and there is less potential for errors when interacting with MLAir. More details on the model class can be found in section 4.5.

Data handler
In analogy to the model class, the data handler organizes all operations related to data retrieval, preparation and provision of the data handler is created for each station automatically and MLAir will take care of the iteration across all stations. To ensure a smooth integration into MLAir, each data handler must also follow certain rules. As with the creation of a model, it is not necessary to modify MLAir's source code. Instead, the AbstractDataHandler class provides guidance on which methods a data handler needs to interact smoothly with the workflow.

190
By default, MLAir uses the DefaultDataHandler. It accesses data from JOIN as demonstrated in section 3.1. A detailed description of how to use this data handler can be found in section 4.4. However, if a different data source or structure is used for an experiment, the DefaultDataHandler must be replaced by a custom data handler based on the AbstractDataHandler. Simply put, such a custom handler requires methods for creating itself at runtime and methods that return the inputs and outputs. Partitioning according to the batch size or suchlike is then handled by MLAir at the 195 appropriate moment and does not need to be integrated into the custom data handler. Further information about custom data handlers follows in section 4.3.

Conducting an experiment with MLAir
Before we dive deeper into available features and the actual implementation, we show three basic examples of the MLAir usage to demonstrate the underlying ideas and concepts and how first modifications can be made (section 3.1). In section 3.2, we then 200 explain how the output of a MLAir experiment is structured and which graphics are created. Finally, we briefly touch on the statistical part of the model evaluation (section 3.3).

Running first experiments with MLAir
To install MLAir, the program can be downloaded as described in the Code availability section and the python library dependencies should be installed from the requirements file. To test the installation, MLAir can be run in a default configuration 205 with no extra arguments (see Fig. 2). These two commands will execute the workflow depicted in Fig. 1. This will perform an ML forecasting experiment of daily maximum ground-level ozone concentrations using a simple feed-forward neural network based on seven input variables consisting of preceding trace gas concentrations of ozone and nitrogen dioxide, and the values of temperature, humidity, wind speed, cloud cover, and the planetary boundary layer height.  Table 3 for details on aggregation). The retrieved data are stored locally to save time on the next execution (the data extraction can of course be configured as described in section 4.4). It is also 215 possible to replace the DefaultDataHandler with a self-made data handler to use other data sources or read in different data structures. An introduction to this is given in section 2.5.
After preprocessing this data, splitting them into training, validation, and test data, and converting them to a xarray and NumPy format (details in section 2.1), MLAir creates a new vanilla feed-forward neural network and starts to train it. Finally, the results are evaluated according to meteorological standards and a default set of plots is created. The trained model, all 220 results and forecasts, the experiment parameters and log files, as well as the default plots are pooled in a folder in the current working directory. Thus, in its default configuration, MLAir performs a meaningful meteorological ML experiment, which can serve as a benchmark for further developments and baseline for more sophisticated ML architectures.
In the second example (Fig. 3), we expand the number of precedent time steps as model inputs to provide more contextual information to the vanilla model. Furthermore, we use a different set of observational stations. Therefore, we need to adjust 225 the parameter window_history_size for the former and stations for the latter in the run call. From a first glance, the output of the experiment run is quite similar to the earlier example. However, there are a couple of aspects in this second experiment, which we would like to point out. Firstly, the DefaultDataHandler keeps track of data available locally and thus reduces the overhead of reloading data from the web if this is not necessary. Therefore, no new data was downloaded for one of the stations (DEBW107), because these data had been stored locally already in our first experiment. Of course the 230 DefaultDataHandler can be forced to reload all data from its source if needed (see section 4.1). The second key aspect to highlight here is that the parameter window_history_size could be changed and the network was trained anew without any problem even though this change affects the shape of the input data and thus the neural network architecture. This is made possible since the model class in MLAir queries the shape of the input variables and adapts the architecture of the input layer accordingly. Naturally, this procedure does not make perfect sense for every model, as it only affects the first layer of the model.

235
In case the shape of the input data changes to a large extent, it is advisable to adapt the entire model as well. Concerning the network output, the second experiment overwrites all results from the first run, because without an explicit setting of the file path, MLAir always uses the same sandbox directory called testrun_network. In a real-world sequence of experiments, we recommend to always specify a new experiment path with a reasonably descriptive name (details on the experiment path in section 4.1).

240
The third example in this section demonstrates the activation of a partial workflow, namely a re-evaluation of a previously trained neural network. We want to rerun the evaluation part with a different set of stations to perform an independent validation. This partial workflow would also be employed if the model is supposed to run in production. As we replace the stations for the new evaluation, we need to create a new testing set, but we want to skip the model creation and training steps. Hence, the parameters create_new_model and train_model are set to False (see Fig. 4). With this setup, the model is loaded 245 from the local file path and the evaluation is performed on the newly provided stations. By combining the stations from the second and third experiment in the station parameter the model can be evaluated at all of these stations together. In this setting, MLAir will fail to execute the evaluation if parameters pertinent for preprocessing or model compilation changed compared to the training run.
It is also possible to continue training of an already trained model. If the train_model parameter is set to True, training 250 will be resumed at the last epoch reached, if this epoch number is lower than the final epoch setting. Use cases for this are either an experiment interruption (for example due to wall clock time limit exceedance on batch systems) or the desire to extend the training if the optimal network weights have not been found yet. Further details on training resumption can be found in section 4.9.

Results of an experiment 255
All results of an experiment are stored in the directory, which is defined during the experiment setup stage (see section 4.1).
The sub directory structure is created at the beginning of the experiment. There is no automatic deletion of files in case of aborted runs so that the information that is generated up to the program termination can be inspected to find potential errors or to check on a successful initialization of the model, etc. Fig. 5 shows the output file structure. The content of each directory is as follows: -The logging folder contains information about the execution of the experiment. In addition to the console output, MLAir also stores messages on the debugging level, which give a better understanding of the internal program sequence.
MLAir has a tracking functionality, which can be used to trace which data have been stored and pulled from the central data store. In combination with the corresponding tracking plot that is created at the very end of each experiment 275 automatically, it allows to visually track which parameters have an effect on which stage. This functionality is most interesting for developers who make modifications to the source code and want to ensure that their changes don't break the data flow.
-The folder model contains everything that is related to the trained model. Besides the file, which contains the model itself (stored in the binary hierarchical data format HDF5, Koranne, 2011), there is also an overview graphic of the model 280 architecture and all callbacks, for example from the learning rate. If a training is not started from the beginning but is either continued or applied to a pre-trained model, all necessary information like the model or required callbacks must be stored in this subfolder.
-The plots directory contains all graphics that are created during an experiment. Which graphics are to be created in post-processing can be determined using the plot_list parameter in the experiment setup. In addition, MLAir 285 automatically generates monitoring plots for instance of the evolution of the loss during training.
As described in the last bullet, all plots which are created during an MLAir experiment can be found in the subfolder plots.
By default, all available plot types are created. By explicitly naming individual graphics in the plot_list parameter, it is possible to override this behaviour and specify which graphics are created during postprocessing. Additional plots are created to monitor the training behaviour. These graphics are always created when a training session is carried out. Most of the plots 290 which are created in the course of postprocessing are publication-ready graphics with complete legend and resolution of 500 dpi. If it is intended to add custom graphics in MLAir, these graphics can be added to the workflow by attaching an additional run module (see section 4.8) including the graphic creation methods.
A general overview of the underlying data can be obtained with the graphics PlotStationMap and PlotAvailability.
PlotStationMap (Fig. 6) marks the geographical position of the used stations on a plain map with a land-sea mask, country 295 boundaries and major water bodies. The data availability chart created by PlotAvailability (Fig. 7) indicates the time periods for which preprocessed data for each measuring station are available. The index data availability also shows whether a station with measurements is available at all for a point in time. In addition, the three subsets for training, validation and testing are highlighted in different colours.
The monitoring graphics show the course of the loss function as well as the error depending on the epoch for the training 300 and validation data (c.f. Fig. 8). In addition, the error of the best model state with respect to the validation data is shown in the heading. If the learning rate is modified during the course of the experiment, another plot is created to show its development.
These monitoring graphics are kept as simple as possible and are meant to provide insight into the training process. The underlying data are always stored in the JavaScript Object Notation format (.json, ISO Central Secretary, 2017) in the subfolder model and can therefore be used for customized plots.

305
Through the graphs PlotMonthlySummary and PlotTimeSeries it is possible to review the forecast of the ML model. The PlotMonthlySummary (see Fig. 9) summarizes, according to its name, all predictions of the model covering all stations but considering each month separately as a box-and-whisker diagram. With this graph it is possible to get a general overview of the distribution of the predicted values compared to the distribution of the observed values for each month. Besides, the exact course of the time series compared to the observation can be viewed in the PlotTimeSeries (not included as 310 figure). However, since this plot has to scale according to the length of the time series, it should be noted that this last-mentioned graph is kept very simple and rather not suitable for publication.

Statistical analysis of results
A central element of MLAir is the statistical evaluation of the results according to state-of-the-art methods used in meteorology.
To obtain specific information on the forecasting model, we treat forecasts and observations as random variables. Therefore, the  (Murphy and Winkler, 1987). Following Murphy et al. (1989), marginal distribution is shown as a histogram (light grey), while the conditional distribution is shown as percentiles in different line styles. By using PlotConditionalQuantiles, MLAir automatically creates plots for the entire test period ( Fig. 10) and, as is common in meteorology, separated by seasons.

320
In order to access the genuine added value of a new forecasting model, it is essential to take other existing forecasting models into account instead of reporting only metrics related to the observation. In MLAir we implemented three types of basic reference forecasts; i) a persistence forecast, ii) an ordinary least square model and iii) four climatological forecasts.
The persistence forecast is based on the last observed time step, which is then used as a prediction for all lead times. The ordinary least square model serves as a linear competitor and is derived from the same data the model was trained with. For 325 the climatological references, we follow Murphy (1988) who defined single and multi valued climatological references based on different time scales. We refer the reader to Murphy (1988) for an in-depth discussion on the climatological reference.
Note, that this kind of persistence and also the climatological forecast might not be applicable for all temporal resolutions and therefore may need adjustment. We think here, for example, of a clear diurnal pattern in temperature, for which a persistence of successive observations would not provide a good forecast. In this context, a reference forecast based on the observation of 330 the previous day at the same time might be more suitable.
For the comparison, we use a skill score S, which is naturally defined as the performance of a new forecast compared to a competitive reference with respect to a statistical metric (Murphy and Daan, 1985). Applying the mean squared error as the statistical metric, such a skill score S reduces to unity minus the ratio of the error of the forecast to the reference. A positive skill score can be interpreted as the percentage of improvement of the new model forecast in comparison to the reference. On 335 the other hand, a negative skill score denotes that the forecast of interest is worse than the referencing forecast. Consequently, a value of zero denotes that both forecasts perform equally (Murphy, 1988).
The PlotCompetitiveSkillScore (Fig. 11) includes the comparison between the trained model, the persistence and the ordinary least squared regression. The climatological skill scores are calculated separately for each forecast step (lead time) and summarized as a full-detail box-and-whiskers plot over all stations and forecasts (Fig. 12), and as simplified version 340 showing the skill score only (not shown) using PlotClimatologicalSkillScore.
In addition to the statistical model evaluation, MLAir also allows to assess the importance of individual input variables through bootstrapping of individual input variables. For this, the time series of each individual input variable is resampled n times (with replacement) and then fed to the trained network. By resampling a single input variable, its temporal information is disturbed, but the general frequency distribution is preserved. The latter is important because it ensures that the model is 345 provided only with values from a known range and does not extrapolate out-of-sample. Afterwards, the skill scores of the bootstrapped predictions are calculated using the original forecast as reference. Input variables that show an overly negative skill score during bootstrapping have a stronger influence on the prediction than input variables with a small negative skill score. In case the bootstrapped skill score even reaches the positive value domain, this could be an indication that the examined variable has no influence on the prediction at all. The result of this approach applied to all input variables is presented in 4 Configuration of experiment, data handler, and model class in the MLAir workflow Beside the already described workflow adjustments, MLAir offers a high number of configuration options. Instead of defining parameters at different locations inside the code, all parameters are centralized set in the experiment setup. In this section, we describe all parameters that can be modified and the authors' choices for default settings when using the default workflow of 355 MLAir.

Host system and processing units
The MLAir workflow can be adjusted to the hosting system. For that, the local paths for experiment and data are adjustable (see Table 1 for all options). Both paths are separated by choice. This has the advantage that the same data can be used multiple times for different experiment setups if stored outside the experiment path. Contrary to the data path placement, all created 360 plots and forecasts are saved in the experiment_path by default, but this can be adjusted through the plot_path and forecast_path parameter.
Concerning the processing units, MLAir supports both central processing units (CPU) and GPUs. Due to their bandwidth optimization and efficiency on matrix operations, GPUs have become popular for ML applications (c.f., Krizhevsky et al., 2012). Currently, the sample models implemented in MLAir are based on TensorFlow v1.13.1, which has distinct branches: 365 the tensorflow-1.13.1 package for CPU computation and the tensorflow-gpu-1.13.1 package for GPU devices respectively.
Depending on the operating system, the user needs to install the appropriate library if using TensorFlow releases 1.15 and older (TensorFlow, 2020). Apart from this installation issue, MLAir is able to detect and handle both TensorFlow versions during run time. An MLAir version to support TensorFlow v2 is planned for the future (see section 5).

370
In the course of preprocessing, the data are prepared to allow immediate use in training and evaluation without further preparation. In addition to the general data acquisition and formatting, which will be discussed in section 4.3 and 4.4, preprocessing also handles the splitting into training, validation, and test data. All parameters discussed in this section are listed in Table 2.
Data are split into subsets along the temporal axis and station between a hold-out data set (called test data) and the data that are used for training (resp. training data) and model tuning (validation data). For each subset, a {train,val,test}_start 375 and {train,val,test}_end date not exceeding the overall time span (see section 4.4) can be set. Additionally, for each subset it is possible to define a minimal number of available samples per station {train,val,test}_min_length to remove very short time series that potentially cause misleading results especially in the validation and test phase. A spatial split of the data is achieved by assigning each station to one of the three subsets of data. The parameter fraction_of_training determines the ratio between hold-out data and data for training and validation, where the latter two are always split with a 380 ratio of 80 % to 20 % which is a typical choice for these subsets.
To achieve absolute statistical data subset independence, data should ideally be split along both temporal and spatial dimension. Since the spatial dependency of two distinct stations may vary related to weather regimes or season and time of day (Wilks, 2011), a spatial and temporal division of the data might be useful, as otherwise a trained model can presumably lead to over-confident results. On the other hand, by applying a spatial split in combination with a temporal division, the amount 385 of utilizable data can drop massively. In MLAir, it is therefore up to the user to split data either in the temporal or along both dimensions by using the use_all_stations_on_all_data_sets parameter.

Custom data handler
The integration of a custom data handler into the MLAir workflow is done by inheritance from the AbstractDataHandler class and implementation of at least the __init__() method, and the accessors get_X(), and get_Y(). The custom 390 data handler is added to the MLAir Workflow as a parameter without initialization. At runtime, MLAir then queries all the required parameters of this custom data handler from it's arguments and keyword arguments, loads them from the data store and finally calls the constructor. If data need to be downloaded or preprocessed, this should be executed inside the constructor.
It is sufficient to load the data in the accessor methods if the data can be used without conversion. We would like to remind that a data handler is only responsible for a single data origin and the iteration and distribution on batches is taken care of by 395 MLAir.
The accessor methods for input and target data form a clearly defined interface between MLAir and the custom data handler.
During training the data are needed as NumPy array, for preprocessing and evaluation the data are partly used as xarray.
Therefore the accessor methods have the parameter as_numpy and should be able to return both formats. Furthermore it is possible to use an individual upsampling technique for training. To activate this feature the parameter upsamling can be 400 enabled. If such a technique is not used and therefore not implemented, the parameter has no further effect.
Two other methods do not return a value in the default implementation, but do not necessarily have to be adapted. With the method transformation it is possible to either define or calculate the transformation properties of the data handler before initialization. The returned properties are then applied to all subdata sets, namely training, validation and testing. Another supporting class method is get_coordinates. This method is currently used only for the map plot for geographical overview 405 (see section 3.2). To feed the overview map, this method must return a dictionary with the geographical coordinates indicated by the keys lat and lon.

Default data handler
In this section we describe a concrete implementation of a data handler, namely the DefaultDataHandler using data from the JOIN interface in detail.

410
Regarding the data handling and preprocessing, several parameters can be set to control the choice of inputs, size of data, etc. in the data handler (see Table 3). First, the underlying raw data is required to load from the web. The current version of the DefaultDataHandler is configured for use with the REST API of the JOIN interface (Schultz et al., 2017c). Alternatively, data could be already available on the local machine in the directory data_path, e.g. from a previous experiment run. Additionally, a user can force MLAir to load fresh data from web by enabling the overwrite_local_data param-415 eter. According to the design structure of a data handler, data are handled separately for each observational station indicated by its ID. By default, the DefaultDataHandler uses all German air quality stations provided by the German Environment Agency (Umweltbundesamt, UBA) that are indicated as "background" stations according to the European Environmental Agency (EEA) Airbase classification (European Parliament and Council of the European Union, 2008). Using the stations parameter, a user-defined data collection can be created. To filter the stations, the parameters network and station_type 420 can be used as described in  and the documentation of JOIN (Schultz et al., 2017c).
For the DefaultDataHandler, it is recommended to specify at least the number of preceding time steps to use for a single input sample (window_history_size), if and which interpolation should be used (interpolation_method), if and how many missing values are allowed to fill by interpolation (limit_nan_fill),

425
and how many time steps the forecast model should predict (window_lead_time).
Regarding the data content itself, each requested variable must be added to the variables list and be part of the statistics_per_v dictionary together with a proper statistic abbreviation (see documentation of Schultz et al., 2017c). If not provided, both parameters are chosen from a standard set of variables and statistics. Regarding the target variable, similar actions are required. Firstly, target variables are defined in target_var, and secondly, the target variable need also to be part of the 430 statistics_per_var parameter. Note that the JOIN REST API calculates these statistics online from hourly values, thereby taking into account a minimum data coverage criterion. Finally, the overall time span the data shall cover can be defined via start and end, and the temporal resolution of the data is set with sampling. At this point, we want to refer to section 5, where we discuss the temporal resolution currently available.

435
The idea of using model classes was already motivated in section 2.4. Here, we show more details on the implementation and customization.
To achieve the goal of an easy plug-and-play behaviour, each ML model implemented in MLAir must inherit from the AbstractModelClass and the methods set_model and set_compile_options are required to be overwritten for the custom model. Inside set_model, the entire model from inputs to outputs is created. Thereby it has to be en-440 sured that the model is compatible with Keras to be compiled. MLAir supports both the functional and sequential Keras application programming interface. For details on how to create a model with Keras, we refer to the official Keras documentation (Chollet et al., 2015). All options for the model compilation should be set in the set_compile_options method. This method should at least include information on the training algorithm (optimizer), and the loss to measure performance during training and optimize the model for (loss). Users can add other compile options like the learning 445 rate (learning_rate), metrics to report additional merely informative performance metrics, or options regarding the weighting as loss_weights, sample_weight_mode or weighted_metrics. Finally, methods that are not part of Keras or TensorFlow like customized loss functions or self-made model extensions are required to be added as so-called custom_objects to the model so that Keras can properly use these custom objects. For that, it is necessary to call the set_custom_objects method with all custom objects as key value pairs. See also the official Keras documentation for 450 further general information on custom objects.
An example implementation of a little model using a single convolution and three fully connected layers is shown in Fig. 14. By inheriting from the AbstractModelClass (l. 9), invoking of its constructor (l. 15), defining the set_model (l. 25 -35) and set_compile_options (l. 37 -41) method, whereas the call of these both methods (l. 21 -22), the custom model is immediately usable for MLAir. Additionally, the loss is added to the custom objects (l. 23). This last step would not be 455 necessary in this case, because an error function incorporated in Keras is used (l. 2 / 40). For demonstration purposes of how to use a customized loss, it is added nevertheless.
For another example we refer to Kleinert et al. (2021) who used extensions to the standard Keras library in their workflow. Socalled inception blocks (c.f. Szegedy et al., 2015) and a modification of the two-dimensional padding layers were implemented as Keras layers and could be used in the model afterwards. In such a case it is important to add the corresponding classes to 460 the custom_objects, as mentioned above.

Training
With the parameters train_model and create_new_model either a halted or interrupted training can be resumed (or extended) or skipped if no training is scheduled since the model was already trained before. Most parameters to set for the training stage are related to hyperparameter tuning (c.f. Table 4). Firstly, the batch_size can be set. Furthermore, the number 465 of epochs to train is required to be adjusted. Last but not least, the used model itself must be provided to MLAir including additional hyperparameters like the learning_rate the algorithm to train the model (optimizer) and the loss function to measure model performance. For more details on how to implement an ML model properly we refer to section 4.5.
Due to its application focus on meteorological time series and therefore on solving a regression problem, MLAir offers a particular handling of training data. A popular technique in ML, especially in the image recognition field, is to augment and 470 randomly shuffle data to produce a larger number of input samples with a broader variety. This method requires independent and identically distributed data. For meteorological applications, these techniques should be carefully selected, because of the lack of statistical independence of most data and auto correlation (see also Schultz et al., 2021, forthcoming). To avoid generating over-confident forecasts, train and test data are split into blocks so that little or no overlap remains between the datasets. Another common problem in ML, not only in the meteorological context, is the natural under-representation of 475 extreme values, i.e. an imbalance problem. To address this issue, MLAir allows placing more emphasis on such data points.
The weighting of data samples is conducted by an over-representation of values that can be considered as extreme regarding the deviation from a mean state in the output space. This can be applied during training by using the extreme_values parameter, which defines a threshold value at which a value is considered extreme. Training samples with target values that exceed this limit are then used a second time in each epoch. It is also possible to enter more than one value for the parameter.

480
In this case, samples with values that exceed several limits are duplicated according to the number of limits exceeded. For positively skewed distributions, it could be helpful to apply this over-representation only on the right tail of the distribution (extremes_on_right_tail_only). Furthermore, it is possible to shuffle data within, and only within, the training subset randomly by enabling permute_data.

485
The configuration of the ML model validation is related to the postprocessing stage. As mentioned in section 2.3, in the default configuration there are three major validation steps undertaken after each run besides the creation of graphics: First, the trained model is opposed to the two reference models, a simple linear regression and a persistence prediction. Second, these models are compared with climatological statistics. Lastly, the influence of each input variable is estimated by a bootstrap procedure.
Due to its encroachment on time or the irrelevance for the custom workflow, the calculation of the input variable sen-490 sitivity can be skipped and the graphics creation part can be shortened. To perform the sensitivity study, the parameter evaluate_bootstraps must be enabled and the number_of_bootstraps defines, how many samples shall be drawn for the evaluation (c.f. Table 5). If such a sensitivity study was already performed and the training stage was skipped, the create_new_bootstraps parameter should be set to False to reuse already preprocessed samples if possible. Regarding the creation of graphics, the parameter plot_list can be adjusted. If not specified, a default selection of graphics is 495 generated. When using plot_list, each graphic to be drawn must be specified individually. More details about all possible graphics have already been provided in section 3.2 and 3.3. In the current version, the validation as part of MLAir's default postprocessing stage cannot be easily extended, but it is still possible to append another run module to the workflow to perform an individual validation additionally.

500
MLAir offers the possibility to define and execute a custom workflow for situations in which special calculations or data evaluation not available in the standard version are to be performed. For this purpose it is not necessary to modify the program code of MLAir, but instead user-defined run modules can be included in a new workflow. This is done analogous to the procedure of model class by inheritance from the base class RunEnvironment and the individually adapted programming of a run module. Compared to the very simple examples from section 3, such a use of MLAir requires a slightly increased effort.

505
The implementation of the run module is done straightforwardly by a constructor method, which initializes the module and executes all desired calculation steps upon call. To execute the custom workflow, the MLAir Workflow class must be loaded and then each run module must be registered. The order in which the individual stages are added determines the execution sequence.
As custom workflows will generally be necessary if a custom run module is to be defined, we briefly describe how the central 510 data store mentioned in section 2.3 interacts with the workflow module. With the data store it is possible to share any kind of information from previous or subsequent stages. By invoking the constructor of the super class during the initialization of a custom run module, the data store is automatically connected with this module. Information can then be set or queried using the accesssor methods get and set. For each saved information object a separate namespace called scope can be assigned.
If not specified, the object is always stored in the general scope. If the scope is specified, a separate sub-scope is created.

515
Information stored in this scope memory cannot be accessed from the general scope memory, but conversely all sub-scopes have access to the general scope. For example, more general objects can be set in the general scope and objects specific to a sub-data set, such as test data, can be stored in under the scope test. If some objects for the keyword test are retrieved from the data store, then for non-existent objects in the test namespace attributes from the general scope are used if available.
An example for the implementation of a custom run module embedded in a custom workflow can be found in Fig. 15. The 520 custom run module named CustomStage inherits from the base class RunEnvironment (l. 4) and calls its constructor (l. 8) on initialization. The CustomStage expects a single parameter (test_string, l. 7), that is used during the run method (l. 11 -15). The run method first logs two information messages by using the test_string parameter (l. 12 -13). Then it extracts the value of the parameter epochs (l. 14) that has been set in the ExperimentSetup (l. 21) from the data store and logs the value of this parameter too. To run this custom run module is has to be included in a workflow. First an empty 525 workflow is created (l. 19) and then individual run modules are attached (l. 21 -23). As last step, this new defined workflow is executed by calling the run method (l. 25).

How to continue an experiment?
There can be different reasons for the continuation of an experiment. First of all, by looking at the monitoring graphs, it could be discovered that training has not yet converged and the number of epochs should be increased. Instead of training a new

Limitations
Even though MLAir addresses a wide range of ML related problems and allows embedding of many different ML architectures and customized workflows, it is still no universal Swiss Army knife, but focuses on the application of neural networks for the task of station time series forecasting. In this section we will explain the limitations of MLAir and why MLAir ends at these 540 points.
Due to the scientifically oriented development of MLAir starting from a specific research question (Kleinert et al., 2021), MLAir could initially only use data from the REST API of JOIN. This binding has already been revoked in the current version, however, the DefaultDataHandler still uses this data source. Furthermore, MLAir always expects a particular structure in the data and especially considers the data as a collection of time series data from various stations. We are currently investigating 545 the possibility of integrating grid data, which could be taken from a weather model, and timeless data such as topography into the MLAir workflow, but cannot yet present any results on how easy such an integration would be.
While MLAir can technically handle data in different time resolutions, it has been tested primarily on daily aggregated data due to the specific science case which served as seed for its development. The use of different temporal resolutions was spot-checked and could be successfully confirmed without obvious errors, but we cannot guarantee that the results will be 550 meaningful if data in other temporal resolutions are used as inputs. In particular, most of the evaluation routines may not make sense for data in less than hourly or greater than daily resolution. Note also that MLAir does not perform explicit error checking or missing value handling. Such functionality must be implemented within the data handler. MLAir expects a ready-to-use data set without missing values provided by the data handler during training.
Another limitation is the choice of the underlying libraries and their versions. Due to the selection of TensorFlow as backend, 555 it is not possible to use PyTorch or other frameworks in combination with MLAir. Specifically, MLAir was developed and tested with TensorFlow version 1.13.1, as the HPC systems on which our experiments are performed support this version. We have already tested MLAir occasionally with the TensorFlow version 1.15 and could not find any errors. But due to the lack of extensive testing, we can therefore not make any reliable statement about the functionality with newer versions like 1.15 or 2.X yet. It is planned to implement an updated version of MLAir with the new TensorFlow version 2.X as soon as our systems 560 support this version without any problems.

Summary
MLAir is an innovative software package intended to facilitate high-quality meteorological studies using ML. By providing an end-to-end solution based on a specific scientific workflow of time series prediction, MLAir enables a transparent and reproducible conduction of ML experiments in this domain. Due to the plug-and-play behaviour it is straightforward to explore 565 different model architectures and change various aspects of the workflow or model evaluation. Although MLAir is focusing on neural networks, it should be possible to include other ML techniques. Since MLAir is based on a pure python environment, it is highly portable. It has been tested on various computing systems from desktop workstations to high-end supercomputers.
MLAir is under continuous development. Further enhancements of the program are already planned and can be found in the issue tracker (see annex code availability). Ongoing developments concern the extension of the statistical evaluation methods, 570 the graphical presentation of the results and the flawless support of temporal resolutions other than daily aggregated data.
Through further code refactoring, MLAir will become even more versatile as the decoupling of individual components is being pushed forward. In particular, it is planned to structure the data handling in a more modular way so that varying structured data sources can be connected and used without much effort. We invite the community of meteorological ML scientists to participate in the further development of MLAir through comments and contributions to code and documentation. A good starting point 575 for contributions is the issue tracker of MLAir.
Even if MLAir cannot be the all-encompassing environment for every kind of meteorological ML problem, we hope that MLAir can serve as a blueprint for application developments in this field, as it seeks to combine best practices from ML with best practices of meteorological model evaluation and data preprocessing. MLAir is thus a contribution to strengthen the integration of the communities of ML and meteorology or air quality research.    PlotClimatologicalSkillScore. Skill scores and terms are shown separately for all forecast steps (dark to light blue). In brief, CASE I to IV describe a comparison with climatological reference values evaluated on the test data. CASE I is the comparison of the forecast with a single mean value formed on the training and validation data and CASE II with the (multi-value) monthly mean. The climatological references for CASE III and IV are, analogous to CASE I and II, the single and the multi-value mean, however, on the test data. CASE I to IV are calculated from the terms AI to CIV. For more detailed explanations of the cases, we refer to Murphy (1988).   Figure 14. Example how to create a custom ML model implemented as model class. MyCustomisedModel has a single 1x1 convolution layer followed by two fully connected layers with a neuron size of 16, and the number of forecast steps. The model itself is defined in the set_model method whereas compile options as the optimizer, loss and error metrics are defined in set_compile_options.
Additionally for demonstration, the loss is added as custom object which is not required because a Keras built-in function is used as loss.    ** These parameters are set in the model class.
*** As default, a vanilla feed-forward neural network architecture will be loaded for workflow testing. The usage of such a simple network for a real application is at least questionable.