the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep Learning Driven Simulations of Boundary Layer Cloud over the US Southern Great Plains
Abstract. This study developed a deep learning model to simulate the complex dynamics of boundary layer clouds (BLCs) over the US Southern Great Plains. Using over twenty years of extensive observations from the Atmospheric Radiation Measurement program for training and validation, the model diagnoses the BLCs from the perspective of cloud-land coupling. Morning meteorological profiles set as the initial conditions and then identifying triggers for BLCs formation from surface meteorology. The deep learning model offer accurate simulation of the convection initiation and cloud base of BLCs. In comparison with reanalysis data (i.e., ERA-5 and MERRA-2), it provides a notable improvement in the vertical structure of low clouds from a climatological perspective. The deep learning model can serve as the cloud parameterization and extend to analyzing stratiform and cumulus clouds within reanalysis frameworks, offering insights into improving the simulation of BLCs. By quantifying biases due to various meteorological factors and parameterizations, this deep learning-driven approach bridges the observational-modeling divide. Surface humidity and parameterization emerge as key limiting factors to affect the representation of BLCs in the reanalysis data. This deep learning approach holds promise for improving the convection parameterization and advancing model diagnostics in weather forecasting and climate modelling.
- Preprint
(2228 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 16 May 2024)
-
RC1: 'Comment on gmd-2024-25', Anonymous Referee #1, 28 Apr 2024
reply
Summary
This study uses high quality observations to develop a machine-learning-based scheme for predicting land-coupled boundary layer cloud fraction at a single point location in Oklahoma in the United States. Â The scheme consists of three machine learning models, which are used in tandem to arrive at cloud fraction predictions at each hour of the day between 8 AM and 6 PM local time. Â Inputs to these models consist of morning radiosonde profiles of relative humidity, potential temperature, and horizontal winds, surface meteorological conditions from the hour preceding and hour coinciding with the prediction time, and predictions from intermediate steps. Â The models achieve moderate success on this prediction problem, accurately predicting the cloud base, and approximately predicting the cloud top height and cloud fraction at 10 levels between the predicted cloud base and cloud height. Â The cloud fraction is generally underestimated, and the cloud top height is generally overestimated below roughly 2 km and underestimated above.
The authors then move on to applying their ML models using data from two reanalysis datasets as input, instead of observations. Â The idea behind this is to illustrate the shortcomings of these reanalysis datasets in simulating boundary layer clouds at this particular site, and use the ML models as a way to estimate whether errors in simulating the clouds can be attributed to errors in predicting underlying meteorological variables versus the errors introduced by the parameterization scheme used. Â They conclude that errors in ERA5 can be attributed mainly to the cloud parameterization, while errors in MERRA-2 can be attributed to both errors in the meteorological fields and cloud parameterization.
To me the most interesting aspect of this study was the fact that it trained ML models purely on observations. Â This was facilitated by the uniquely extensive observations taken at the ARM SGP site. Â This is a strength in one sense in that the ground truth has strong credibility, but it is also a weakness in another in that it limits the applicability of the trained models (and general approach) to a single point location. Â The parameterization strategy is also less applicable to general circulation models (GCMs), since GCMs simulate vertical profiles of fields at all grid points at every timestep, so (unlike in the case limited by observations) there is no need to temporally separate vertical profiles from surface meteorological quantities. Â Nevertheless, it is useful to see that a machine learning model can be trained to predict observed boundary layer clouds better than existing physical parameterizations in models used to produce reanalysis data, at least in an isolated setting. Â For greater impact, a more generalizable model will be key, but that can be saved for discussion of future work. Â This study could be worth publishing after addressing some comments and questions.
General comments
- A cleaner and more complete description of the machine learning approach could be helpful. Â For instance I think the feature importance scores in Table 1, which are illustrated in a more interpretable way in Figure 3, could be replaced by some more metadata about the predictor and target fields (e.g. the short names could be accompanied by longer descriptions, including the data source; see e.g. Table 1 in Payami et al., 2024). Â The network structures I think could be described in the text. Â In addition some more details about the networks could be provided. Â What activation functions were used between the layers? Â What was the optimizer used to train the networks? Â What were the loss functions? Â What was the batch size used during training? Â What was the learning rate?
- The structure of the overall model is complicated. Â In particular the way of separating the predictions of the top and bottom of the cloud layer from the cloud fraction within the cloud layer is unusual (as opposed to simply predicting a cloud fraction at a static set of vertical levels). Â How was this arrived upon? Â In addition, how were the inputs chosen? Â There are many, and to some extent some could be considered redundant. Â For instance I gather that the BLH_P, BLH_SH, and LCL inputs are derived from fields that overlap in part with other inputs; does omitting those and retraining lead to significant degradation in skill? Â Also instead of month and local time, could something more physical like insolation be used, which would capture both effects, and be better suited for generalizability?
- It is acknowledged briefly as future work, but what challenges might be present in trying to apply this approach globally? Â One aspect that stands out is that we do not have such high-quality detailed observations of clouds and radiosonde profiles everywhere. Â How would one address that? Â Data-driven models typically struggle with generalization, so it is unlikely that the model trained for this specific location would be drop-in applicable in other synoptic regions without being exposed to more diverse training data.
Specific comments
Lines 26-28: this sentence is not clear. Â Should it be something like "Morning meteorological profiles are the initial conditions and then triggers for the formation of BLCs are identified from surface fields."?
Lines 47-48: "These clouds [...] are the critical part for weather prediction and climate modeling [...]." Â I might switch from "the critical part" to "a critical part," since clouds are not the only important feature to get right for weather or climate modeling.
Line 78: O'Gorman and Dwyer (2018) did not use observational data; they aimed to use ML to merely emulate (rather than improve upon) a convection scheme in an idealized model. Â Similarly neither did Gentine et al. (2018); they derived an ML parameterization of convection using data from a more expensive super-parameterized simulation. Â I think Zhang et al. (2021) is the only study cited here that can be said to have used observational data.
Lines 96-98: "By serving as the cloud parameterization in the reanalysis data, this model advanced the capability of low cloud simulations within reanalysis frameworks." Â I think I get what is being said here, but it is important to emphasize that this is an offline approach, meaning the clouds are predicted based on output data and not embedded in the simulations that produce the reanalysis data itself (thus they cannot affect things like the radiative heating rates and fluxes in the reanalysis data).
Lines 104-109: it might be helpful to emphasize—if I understand correctly—that while ARM SGP takes measurements of some fields at an array of locations across the general SGP region, they only launch radiosondes regularly at this one particular point location, and therefore this study pertains only to that spot. This is quite different than many ML studies which use either data from reanalysis or climate model simulations for training, which is not directly observed (i.e. so can have its own internal biases) but at least is global in nature, without any missing data in time or space.  Citing a paper like Sisterson et al. (2016) might be helpful for those who want more historical background on the SGP site.
Line 188: "Launched routinely at multiple times daily [...]" Can this be quantified in some way? Â E.g. approximately how many times per day is it done? Â Is the important aspect for this study that a radiosonde was launched roughly every morning? Â Is that at a particular time of day?
Lines 169-172: it could be helpful to note the purpose of this reanalysis data up front, contrasting it to the purpose of the observational data described earlier. Â As I understand it, the reanalysis data is mainly used as a way to illustrate how boundary layer clouds are misrepresented in common data sources and as a way to try to disentangle why that might be the case. Â Unlike the observational data, it is not used in any way to train the ML models.
Lines 196-197: "models are purpose-built to simulate the initiation, positioning, and vertical extent of BLCs." Â It might also be worth adding "at the SGP site," since it is likely that these models would likely not be sufficient at other locations given the limitations of the training dataset.
Lines 212-216: "To represent the vertical structure of BLC, we equally segmented the cloud layer from the base to the top into ten levels. For each of these levels, our deep learning models calculate individual cloud fraction values." Â So the vertical position of the layers your models calculate cloud fraction for change depending on the cloud base and cloud top? How would the cloud fraction network know what portion of the morning profiles were most relevant to the cloud fraction? Â Why was this more complicated model architecture chosen instead of simply skipping straight to predicting a cloud fraction at a static set of vertical levels?
Table 1: why is the trigger value an input to the other two models instead of just using the other two models only when the predicted trigger value is greater than 0.5? Â If I understand correctly, with the current approach there is no guarantee that the classification statistics presented in Figure 4 will be relevant in the full problem.
Line 227: what is the strength of the L2 regularization?
Lines 244-246: "Additionally we incorporate datasets from 2017-2020 as part of our validation process, specifically focusing on data from the untrained period to assess the model's performance." Â If I understand correctly, this is your "test" dataset in ML parlance. Â Therefore I might rephrase this as "Additionally we save data from 2017-2020 for testing, specifically focusing on data from this untrained period to assess the model's performance."
Lines 246-248: "The training and validations are both using the more than 20-year BLC observations, as well as the ARMBE products." Â I'm not sure I totally follow this sentence, since the previous few sentences describe the training data / validation datasets as coming from 1998 - 2016 (which is less than 20 years) and the test dataset coming from 2017 - 2020 (which is also less than 20 years). Â In general I'm not sure what this sentence adds, since having data from these various sources for the time periods cited (which, yes, are in aggregate over 20 years) is already implied, so I think it could be removed.
Lines 285-287: for the morning profiles, which as I understand it are multiple input features each, I take it this permutation was done using all the profile values for a particular variable at once? Â This seems reasonable, but might be worth describing in the manuscript.
Lines 312-320: it is a bit odd to describe these specific input parameters—and how they were derived—only at the moment when describing feature importance (and after discussing some sample model predictions).  It would be better to describe this earlier when describing the structure of the different models, e.g. in Section 3.1.
Table 1: I'm not sure I see the value of presenting the precise numerical importance scores in addition to the bar chart in Figure 3 (I find the bar chart more interpretable).
Line 338: "to identify and simulate from surface meteorology." Â Should this also include a reference to the morning radiosonde inputs?
Line 357: "Table 2 complements the Figure 4" It seems Table 2 is completely redundant with Figure 4. Â I would probably keep Figure 4, since it includes slightly more information.
Figure 4: where are the F1 scores shown? Â From what I can tell, precision, recall, and accuracy are shown, but not F1 scores. Â Also I do not think it is important to explicitly show the performance within the training data. Â What matters most is the performance on the held out test data.
Lines 361-364: "The table highlights the model's robustness, with overall accuracy rates of 92.3% for the trained period and a slightly reduced but still substantial 89.2% for the untrained period." Â Given that the datasets are imbalanced (i.e. there are fewer occurrences than non-occurrences) the accuracy is perhaps not the best metric to highlight. Â The precision and recall are both reasonably high, and might be better to highlight. Â See discussion in this TensorFlow tutorial regarding classification of imbalanced data, in particular the note about the accuracy metric: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data.
Figure 6: again I think showing the results on the held out dataset alone is standard and sufficient.Â
Technical corrections
Line 28: "offer" -> "offers"
Lines 41-42: "stratiforms and shallow cumuli" -> "stratiform and shallow cumulus types"
Line 53: "of land surface" -> "of the land surface"
Line 57: "simulating the boundary layer clouds" -> "simulating boundary layer clouds"
Line 88: "structural structure" -> "structure"
Line 90: "diurnal-varying" -> "diurnally varying"
Figure 6: "Independant" -> "Independent"
Figure 12: "attribute" -> "attributed"
Line 590: "Sesning" -> "Sensing"
References
Payami, M., Choi, Y., Salman, A. K., Mousavinezhad, S., Park, J., & Pouyaei, A. (2024). A 1D CNN-Based Emulator of CMAQ: Predicting NO2 Concentration over the Most Populated Urban Regions in Texas. Artificial Intelligence for the Earth Systems, 3(2). https://doi.org/10.1175/AIES-D-23-0055.1
Sisterson, D. L., Peppler, R. A., Cress, T. S., Lamb, P. J., & Turner, D. D. (2016). The ARM Southern Great Plains (SGP) Site. Meteorological Monographs, 57(1), 6.1-6.14. https://doi.org/10.1175/AMSMONOGRAPHS-D-16-0004.1
Citation: https://doi.org/10.5194/gmd-2024-25-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
279 | 131 | 12 | 422 | 15 | 13 |
- HTML: 279
- PDF: 131
- XML: 12
- Total: 422
- BibTeX: 15
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1