|Gil et al present a model based on Deep Neural Networks for estimation of HONO concentrations in urban environments using measurements of classical atmospheric pollutants and meteorological variables as input. Because HONO measurements are hardly available, the authors argue that using estimated HONO as input in photochemical models improves the calculation of the OH production rate and of O3 concentrations. This is an interesting and valuable piece of work, however, there are a few issues that should be resolved before this manuscript can be published in Geoscientific Model Development.|
My main concern is the way the performance of the RND v0.1 model is evaluated and the conclusions and recommendations that are drawn from the model performance evaluation. For performance testing, both, the training set and the testing set are used. This is not correct, the training and validation data should only be used for model building, the performance assessment should only be done based on the test data. It is found that the model performance is much better for the period used for training and validation than for the test data. This is of course not surprising and indicates clear limitations of the model (e.g. over-fitting). The model performance assessment needs to be changed accordingly.
The model performance was particularly poor for the test data from April 2029. The authors explain this by the fact that the conditions during April 2019 were different from the conditions covered by the training data. This points to another important aspect that is entirely neglected in the current manuscript: What are the conditions the RND v0.1 model can be applied with a performance as determined? What happens when the model is applied to conditions that are not covered by the training data (model applied to meteorological conditions and/or atmospheric pollutant concentrations outside the range covered in the training data)? It is very likely that applications of the proposed DNN model at other locations and during other times of the year will face this situation. It is necessary that this issue is addressed.
The authors say in the abstract and in the introduction section that the RND v0.1 model is proposed for calculation of HONO mixing ratios in highly polluted urban environments. In the results section, the model is described as being fit for application in any urban area (page 6, line 172). The conditions (in terms of air pollutant concentrations) where RND v0.1 can be applied should be made more clear.
The paper is generally well written, however, there are rather many small linguistic errors such as missing articles (e.g. page 3, line 84; pg. 5, lines 139 and 140; page 6 line 159) and wrong grammar (e.g. should consequently be "training and validation" instead of "train and validation", and also often "testing" instead of "test". The manuscript should again be carefully checked and corrected.
Page 2, line 54-56: The authors write about "the" model and "this underestimation". It is unclear what model is meant, it seems that it is referred to photochemical models in general. Please make this clear and revise accordingly.
Page 3, line 70, should be "including data collection" instead of "including collecting data".
Page 4, line 95-97. The 10th and 90th percentile mixing ratios for the input variables are given. It is not mentioned what the time basis of these values are, are these hourly or daily values? The temporal resolution should be provided.
Page 4, line 102. Terminology "chemical and meteorological parameters" is not correct here. In the usual convention, the input variables are denoted as "variables" and not as "parameters". The parameters are their weights in a statistical model. Please change.
Page 4, lines 105-107. The authors write that wind direction "should" be converted and there "should" be no missing values. From the text it seems clear that the authors have converted the measured wind direction and they have removed observations with missing values. I think the authors should rephrase the text so that it is clear what data conversion and selection steps have been done.
Page 4, equation 1. I stumbled over the notation F1 and F2. It seems that these are simply the observed min and max of variable x. Why not denoting F1 and F2 as x_min and x_max? Would probably be more clear.