the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep-learning-driven simulations of boundary layer clouds over the Southern Great Plains
Yunyan Zhang
Download
- Final revised paper (published on 27 Aug 2024)
- Preprint (discussion started on 11 Mar 2024)
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2024-25', Anonymous Referee #1, 28 Apr 2024
Summary
This study uses high quality observations to develop a machine-learning-based scheme for predicting land-coupled boundary layer cloud fraction at a single point location in Oklahoma in the United States. The scheme consists of three machine learning models, which are used in tandem to arrive at cloud fraction predictions at each hour of the day between 8 AM and 6 PM local time. Inputs to these models consist of morning radiosonde profiles of relative humidity, potential temperature, and horizontal winds, surface meteorological conditions from the hour preceding and hour coinciding with the prediction time, and predictions from intermediate steps. The models achieve moderate success on this prediction problem, accurately predicting the cloud base, and approximately predicting the cloud top height and cloud fraction at 10 levels between the predicted cloud base and cloud height. The cloud fraction is generally underestimated, and the cloud top height is generally overestimated below roughly 2 km and underestimated above.
The authors then move on to applying their ML models using data from two reanalysis datasets as input, instead of observations. The idea behind this is to illustrate the shortcomings of these reanalysis datasets in simulating boundary layer clouds at this particular site, and use the ML models as a way to estimate whether errors in simulating the clouds can be attributed to errors in predicting underlying meteorological variables versus the errors introduced by the parameterization scheme used. They conclude that errors in ERA5 can be attributed mainly to the cloud parameterization, while errors in MERRA-2 can be attributed to both errors in the meteorological fields and cloud parameterization.
To me the most interesting aspect of this study was the fact that it trained ML models purely on observations. This was facilitated by the uniquely extensive observations taken at the ARM SGP site. This is a strength in one sense in that the ground truth has strong credibility, but it is also a weakness in another in that it limits the applicability of the trained models (and general approach) to a single point location. The parameterization strategy is also less applicable to general circulation models (GCMs), since GCMs simulate vertical profiles of fields at all grid points at every timestep, so (unlike in the case limited by observations) there is no need to temporally separate vertical profiles from surface meteorological quantities. Nevertheless, it is useful to see that a machine learning model can be trained to predict observed boundary layer clouds better than existing physical parameterizations in models used to produce reanalysis data, at least in an isolated setting. For greater impact, a more generalizable model will be key, but that can be saved for discussion of future work. This study could be worth publishing after addressing some comments and questions.
General comments
- A cleaner and more complete description of the machine learning approach could be helpful. For instance I think the feature importance scores in Table 1, which are illustrated in a more interpretable way in Figure 3, could be replaced by some more metadata about the predictor and target fields (e.g. the short names could be accompanied by longer descriptions, including the data source; see e.g. Table 1 in Payami et al., 2024). The network structures I think could be described in the text. In addition some more details about the networks could be provided. What activation functions were used between the layers? What was the optimizer used to train the networks? What were the loss functions? What was the batch size used during training? What was the learning rate?
- The structure of the overall model is complicated. In particular the way of separating the predictions of the top and bottom of the cloud layer from the cloud fraction within the cloud layer is unusual (as opposed to simply predicting a cloud fraction at a static set of vertical levels). How was this arrived upon? In addition, how were the inputs chosen? There are many, and to some extent some could be considered redundant. For instance I gather that the BLH_P, BLH_SH, and LCL inputs are derived from fields that overlap in part with other inputs; does omitting those and retraining lead to significant degradation in skill? Also instead of month and local time, could something more physical like insolation be used, which would capture both effects, and be better suited for generalizability?
- It is acknowledged briefly as future work, but what challenges might be present in trying to apply this approach globally? One aspect that stands out is that we do not have such high-quality detailed observations of clouds and radiosonde profiles everywhere. How would one address that? Data-driven models typically struggle with generalization, so it is unlikely that the model trained for this specific location would be drop-in applicable in other synoptic regions without being exposed to more diverse training data.
Specific comments
Lines 26-28: this sentence is not clear. Should it be something like "Morning meteorological profiles are the initial conditions and then triggers for the formation of BLCs are identified from surface fields."?
Lines 47-48: "These clouds [...] are the critical part for weather prediction and climate modeling [...]." I might switch from "the critical part" to "a critical part," since clouds are not the only important feature to get right for weather or climate modeling.
Line 78: O'Gorman and Dwyer (2018) did not use observational data; they aimed to use ML to merely emulate (rather than improve upon) a convection scheme in an idealized model. Similarly neither did Gentine et al. (2018); they derived an ML parameterization of convection using data from a more expensive super-parameterized simulation. I think Zhang et al. (2021) is the only study cited here that can be said to have used observational data.
Lines 96-98: "By serving as the cloud parameterization in the reanalysis data, this model advanced the capability of low cloud simulations within reanalysis frameworks." I think I get what is being said here, but it is important to emphasize that this is an offline approach, meaning the clouds are predicted based on output data and not embedded in the simulations that produce the reanalysis data itself (thus they cannot affect things like the radiative heating rates and fluxes in the reanalysis data).
Lines 104-109: it might be helpful to emphasize—if I understand correctly—that while ARM SGP takes measurements of some fields at an array of locations across the general SGP region, they only launch radiosondes regularly at this one particular point location, and therefore this study pertains only to that spot. This is quite different than many ML studies which use either data from reanalysis or climate model simulations for training, which is not directly observed (i.e. so can have its own internal biases) but at least is global in nature, without any missing data in time or space. Citing a paper like Sisterson et al. (2016) might be helpful for those who want more historical background on the SGP site.
Line 188: "Launched routinely at multiple times daily [...]" Can this be quantified in some way? E.g. approximately how many times per day is it done? Is the important aspect for this study that a radiosonde was launched roughly every morning? Is that at a particular time of day?
Lines 169-172: it could be helpful to note the purpose of this reanalysis data up front, contrasting it to the purpose of the observational data described earlier. As I understand it, the reanalysis data is mainly used as a way to illustrate how boundary layer clouds are misrepresented in common data sources and as a way to try to disentangle why that might be the case. Unlike the observational data, it is not used in any way to train the ML models.
Lines 196-197: "models are purpose-built to simulate the initiation, positioning, and vertical extent of BLCs." It might also be worth adding "at the SGP site," since it is likely that these models would likely not be sufficient at other locations given the limitations of the training dataset.
Lines 212-216: "To represent the vertical structure of BLC, we equally segmented the cloud layer from the base to the top into ten levels. For each of these levels, our deep learning models calculate individual cloud fraction values." So the vertical position of the layers your models calculate cloud fraction for change depending on the cloud base and cloud top? How would the cloud fraction network know what portion of the morning profiles were most relevant to the cloud fraction? Why was this more complicated model architecture chosen instead of simply skipping straight to predicting a cloud fraction at a static set of vertical levels?
Table 1: why is the trigger value an input to the other two models instead of just using the other two models only when the predicted trigger value is greater than 0.5? If I understand correctly, with the current approach there is no guarantee that the classification statistics presented in Figure 4 will be relevant in the full problem.
Line 227: what is the strength of the L2 regularization?
Lines 244-246: "Additionally we incorporate datasets from 2017-2020 as part of our validation process, specifically focusing on data from the untrained period to assess the model's performance." If I understand correctly, this is your "test" dataset in ML parlance. Therefore I might rephrase this as "Additionally we save data from 2017-2020 for testing, specifically focusing on data from this untrained period to assess the model's performance."
Lines 246-248: "The training and validations are both using the more than 20-year BLC observations, as well as the ARMBE products." I'm not sure I totally follow this sentence, since the previous few sentences describe the training data / validation datasets as coming from 1998 - 2016 (which is less than 20 years) and the test dataset coming from 2017 - 2020 (which is also less than 20 years). In general I'm not sure what this sentence adds, since having data from these various sources for the time periods cited (which, yes, are in aggregate over 20 years) is already implied, so I think it could be removed.
Lines 285-287: for the morning profiles, which as I understand it are multiple input features each, I take it this permutation was done using all the profile values for a particular variable at once? This seems reasonable, but might be worth describing in the manuscript.
Lines 312-320: it is a bit odd to describe these specific input parameters—and how they were derived—only at the moment when describing feature importance (and after discussing some sample model predictions). It would be better to describe this earlier when describing the structure of the different models, e.g. in Section 3.1.
Table 1: I'm not sure I see the value of presenting the precise numerical importance scores in addition to the bar chart in Figure 3 (I find the bar chart more interpretable).
Line 338: "to identify and simulate from surface meteorology." Should this also include a reference to the morning radiosonde inputs?
Line 357: "Table 2 complements the Figure 4" It seems Table 2 is completely redundant with Figure 4. I would probably keep Figure 4, since it includes slightly more information.
Figure 4: where are the F1 scores shown? From what I can tell, precision, recall, and accuracy are shown, but not F1 scores. Also I do not think it is important to explicitly show the performance within the training data. What matters most is the performance on the held out test data.
Lines 361-364: "The table highlights the model's robustness, with overall accuracy rates of 92.3% for the trained period and a slightly reduced but still substantial 89.2% for the untrained period." Given that the datasets are imbalanced (i.e. there are fewer occurrences than non-occurrences) the accuracy is perhaps not the best metric to highlight. The precision and recall are both reasonably high, and might be better to highlight. See discussion in this TensorFlow tutorial regarding classification of imbalanced data, in particular the note about the accuracy metric: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data.
Figure 6: again I think showing the results on the held out dataset alone is standard and sufficient.
Technical corrections
Line 28: "offer" -> "offers"
Lines 41-42: "stratiforms and shallow cumuli" -> "stratiform and shallow cumulus types"
Line 53: "of land surface" -> "of the land surface"
Line 57: "simulating the boundary layer clouds" -> "simulating boundary layer clouds"
Line 88: "structural structure" -> "structure"
Line 90: "diurnal-varying" -> "diurnally varying"
Figure 6: "Independant" -> "Independent"
Figure 12: "attribute" -> "attributed"
Line 590: "Sesning" -> "Sensing"
References
Payami, M., Choi, Y., Salman, A. K., Mousavinezhad, S., Park, J., & Pouyaei, A. (2024). A 1D CNN-Based Emulator of CMAQ: Predicting NO2 Concentration over the Most Populated Urban Regions in Texas. Artificial Intelligence for the Earth Systems, 3(2). https://doi.org/10.1175/AIES-D-23-0055.1
Sisterson, D. L., Peppler, R. A., Cress, T. S., Lamb, P. J., & Turner, D. D. (2016). The ARM Southern Great Plains (SGP) Site. Meteorological Monographs, 57(1), 6.1-6.14. https://doi.org/10.1175/AMSMONOGRAPHS-D-16-0004.1
Citation: https://doi.org/10.5194/gmd-2024-25-RC1 -
AC1: 'Reply on RC1', Tianning Su, 12 Jun 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-25/gmd-2024-25-AC1-supplement.pdf
-
RC2: 'Review of the original manuscript - MAJOR REVISIONS', Anonymous Referee #2, 04 May 2024
A deep-learning algorithm is presented that forecasts Boundary-Layer clouds based on morning-soundings and surface fluxes. The network is trained and validated with observational data of ARM’s souther great plains (SGP) site. The skill of the new model is analyzed in terms of cloud triggering, their vertical structure and cloud fraction. An attribution of forecast errors is undertaken based on the network model that illustrates major factors influencing the representation of boundary-layer clouds in the context of this model.
The authors present a technically involved analysis using a broad scope of observational and modeling data which are synthesized into a deep-learning neural-network model to represent boundary-layer clouds over the ARM’s Souther Great Plains (SGP) site. The analysis is novel and suits the scope of GMD well. In some aspects, the quality of the manuscript does, however, not hold up to the standards of the journal. This holds for (i) the methodological description (which in the present form would not allow to reproduce the findings), (ii) use of English Language and (iii) contextualization of the work in the scope of existing models (mixed-layer models) for the problem considered. I therefore recommend to the editors to reconsider the manuscript for review after the comments below have been addressed.
Cross Review
I agree with the comments raised by anonymous Reviewer #1, in particular I want to subscribe to his first general demanding a cleaner and more complete description of the machine-learning procedure.
Major remarks
-
The methodological description of the deep-learning model is rather obscure. Even after carefully reading the manuscript, it remains unclear what exactly is input to the model and what is the output? Does the model forecast an entire day or just a single instance? What is given in terms of the surface parameters – the evolution of fluxes up to the moment of forecast? Or the value at this time? How exactly are the trigger, vertical position and horizontal cloud fraction components of the model related? Shouldn’t the vertical position and horizontal fraction constrain each other from the perspective of available humidity? Or does one trigger the other – if so in what direction: does non-zero cloud fraction trigger the vertical positioning or vice versa?
-
Data Representativity should be assessed and discussed. The entire analysis focuses on a single site. This comes with two major restrictions that need careful consideration:
-
the current findings are constrained to the surface configuration of the SGP site; as such the insight to cloud—surface coupling may be substantially limited.
-
When comparing to reanalysis data, it needs to be taken into account that the reanalysis is representative for a large area – in comparison to the local nature of the DNN model and observation data. What is the local heterogeneity of cloud fields in the area covered by an ERA5 or MERRA grid cell? In other words – to what extent is the current method to be understood us a downscaling rather than an improved parameterization. To a lesser, but possibly non-negligible (?), extent, this also applies to time-locality.
-
-
Embedment in large-scale models would break the physical consistency of the reanalyzed state. While I agree that there is merit in the (local) DNN cloud representation vs. the large-scale reanalysis, I doubt that the “representation” of clouds can be improved by simply using the DNN output in the context of the reanalyses. We should not forget that the reanalysis produces a heavily constrained, physically consistent approximation with the observations. Simply changing the cloud representation would thus, most likely deteriorate many other parameters as it would break the consistency.
-
Relation to mixed-layer (single-column) models – what is the added benefit of the rather complex DNN approach in comparison to simple low-order mixed-layer models which also capture the daily evolution of the boundary layer given an initial state and time series of surface fluxes? From a fundamental viewpoint, these models have a number of advantages vs. the DNN as they are (i) a loat cheaper to run, (ii) contain physical reasoning, (iii) are available for many of the large-scale models and thus do not need to be implemented.
-
Style and use of English language. The manuscript is full of syntactical and grammatical errors, which partly obscures the scientific contents. It needs to be carefully checked by a language editor before it may be re-considered for publication. I will list some recurring errors in my technical comments, but this list is by no means complete.
Minor remarks
- l. 86: ‘comprehensive data’ – what kind of data? Please use a more telling attribute!
- l.89 ‘By assimilating morning radiosonde observations’ – the procedure of deep learning is not really an assimilation (please check throughout the manuscript!)
- l. 91 ‘ […] uniquely positioned to unravel the complex initiation […]’ Even after careful assessment of the manuscript, I do not agree on this statement: First, the model is not uniquely positioned as other models exist that can cope with the processes in question (mixed-layer models, LES, etc.); Second, I do not agree that it unravels the initiation and evolution (which would correspond to a causal attribution.)
- Paragraph lines 82-92. At the end of this paragraph, it remains unclear what is training, input and output data for the DNN. This should be clarified here, at least qualitatively.
- Section 2: “Data and instruments” – the section title does not reflect the contents; it also includes the cloud detection algorithm and a regime classification.
- l. 149 Why does CBH need to align with the LIDAR-detected PBL top? A BLC can also be initiated far below the PBL top…
- l. 157/8 “typically lasting more than three hours”. I suppose, this is a threshold for automatic detection, so is it three hours, or not? (“Typically “ is rather confusing here…)
- l. 153-166 The classification is unclear to me. Is it done per day or per situation? For the stratiform cases, there is a three-hour threshold, but for the cumulus cases, there is no threshold in terms of duration (other than that the clouds emerge after local sunrise). So, how is it possible to characterize days then – these criteria could be evaluated separately for any situation, and certainly there are days in which regime shifts occur.
- Regarding normalization: The target data (cloud trigger, cloud vertical structure and cloud fraction) is already normalized (binary [0,1] , binary vector with elements [0,1] , real vector with real elements from the range [0,1]). Why does normalization need to be applied here? In what sense would it help?
- Fig. 1 need to be improved and is incosistent with part of the main text.
- The RH profiles / morning SONDE is duplicated information and should be within the box entitled “input”. Are “profiles” the same as “soundings”? Or is there a difference?
- LT and MONTH should be combined to the azimuth angle as this is the actual physical parameter that matters and just gets encoded by Month and LT.
- What is the difference between surface meteorology and surface fluxes? In my understanding, surface fluxes are part of the surface meteorological parameters…
- The vertical alignment reflecting the geometry of the boundary layer is not appropriate here and confuses. A schematic focusing on input and output would help.
- If the models for cloud position, cloud fraction and trigger are independent entities, they should be reflected by separate hidden layers.
- The tag “cloud position” is potentially confusing here; I suggest to rephrase as “cloud structure” or “vertical position” as position alone might be mistaken for horizontal position.
- The relation between cloud trigger, cloud position and cloud fraction should be made more clear; If I understand correctly, the trigger is part of the input for the cloud position and cloud fraction models, but this is not appropriately reflected in the schematic.
- What is meant by time indicators?
- Are the meteorological parameters / fluxes input as instantaneous values or daily time series? Correspondingly: is the output produced per time instant or rather per day?
- Tab. 1 / Text Why is data used if it contributes a negative feature importance?
- l. 338/339 “measurements” and “surface meteorology” – what exactly is mean here?
- l. 398’ “Parameterization” – the DNN model has not parameterization for the cloud top.
- l. 491/ l.559 ’ (also compare major point #3) “a more accurate representation” – the DNN is employed here as an offline, a posteriori analysis tool. While you convincingly argue that the DNN has skill to yield a better cloud field, it is misleading to talk about a “better representation” as the DNN is run offline. In fact, the cloud field modified by the DNN is most likely inconsistent with the reanalysis! So, there is no better representation of clouds by just applying the DNN.
- l. 577/8 “advancing our understanding of BLC dynamics” – this is not true. While we get improved cloud fields, the DNN tells little about the dynamics; in fact, the point of ML / deep learning is that we can have forecasts without understanding of the dynamics.
- l. 577/8 “improving the representation of low clouds – see above point for lines 491/559.
Technical comments / Typos
- l. 97: ‘this model’ – which model?
- l. 99: ‘we strive to narrow the gaps in boundary layer clouds’ – bad style, please rewrite!
- l. 139/140 Syntax incorrect.
- l. 147/148 please cite the data by their DOI (which is provided on the ARM website)
- l. 148 abbreviation CBH for cloud base height is introduced to late; The phrase has been used before.
- l. 183 ‘a advanced’ – please correct!
- l. 581 What are “synoptic regions”?
Citation: https://doi.org/10.5194/gmd-2024-25-RC2 -
AC2: 'Reply on RC2', Tianning Su, 12 Jun 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-25/gmd-2024-25-AC2-supplement.pdf
-