Articles | Volume 18, issue 23
https://doi.org/10.5194/gmd-18-9417-2025
© Author(s) 2025. This work is distributed under the Creative Commons Attribution 4.0 License.
Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)
Download
- Final revised paper (published on 03 Dec 2025)
- Preprint (discussion started on 16 May 2025)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-450', Anonymous Referee #1, 17 Jun 2025
- AC1: 'Reply on RC1', Alexander Hermanns, 08 Aug 2025
-
RC2: 'Comment on egusphere-2025-450', Anonymous Referee #2, 18 Jun 2025
- AC2: 'Reply on RC2', Alexander Hermanns, 08 Aug 2025
-
RC3: 'Comment on egusphere-2025-450', Anonymous Referee #3, 22 Jun 2025
- AC3: 'Reply on RC3', Alexander Hermanns, 08 Aug 2025
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
AR by Alexander Hermanns on behalf of the Authors (08 Aug 2025)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (23 Aug 2025) by Yongze Song
RR by Anonymous Referee #3 (18 Sep 2025)
RR by Anonymous Referee #1 (26 Sep 2025)
ED: Publish subject to technical corrections (15 Oct 2025) by Yongze Song
AR by Alexander Hermanns on behalf of the Authors (27 Oct 2025)
Author's response
Manuscript
Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)
Alexander Hermanns et al.
GENERAL COMMENT
The manuscript describes a method to distribute observation time series over clusters, and to use this within the context of data assimilation. The method is applied with the regional air quality model EURAD-IM, which assimilates time series of surface observations to guide the model. Specifically, the clustering is used to sub-divide the observation time series in an "assimilation" subset (~70%, incorporated in the assimilation) and a "validation" subset (~30%, not incorporated). The posterior comparison between analyzed model state and observations should give the same statistics over the "assimilation" and "validation" set, but as shown by the manuscript too, the assimilation usually performs better over the "assimilated" set. The proposed clustering method improves the equality between the statistics over the "assimilation" and "validation" set, and is therefore of interest for all data-assimilation applications.
The clustering method is well described, and easy to follow also for readers without a background in clustering. The application is illustrated for the European air quality network. Especially the maps in Figure 2 and bar plots in Figure 3 are useful here, as they illustrate the result of the clustering and how it was achieved. The 8-clusters obtained with the KSC method shows for example the soft borders between geographical regions, which could not be obtained by simply clustering countries. As described by the authors, the map obtained with KSC shares characteristics with climate zones, which gives trust that the obtained clustering is also related to geographic properties.
The improvement in assimilation/validation (AV) statistics is illustrated based on comparison with CO and NO2 observations (Table 1). In general, RMSE over the assimilation set increases, while the RMSE over the validation set decreases, thus decreasing what is called here the AV-difference. This is a very important result, and shows the usefulness of the method. However, as table C1 shows, the AV-difference increases for most other considered species (SO2, PM2.5, and PM10), and depending on the clustering method, also for O3. These species are rather important for air quality, and one could argue that these are even more important than CO. Therefore, the method seems not immediately applicable yet in for example the CAMS assimilations in which EURAD-IM is included. Could the authors include a discussion on how the clustering could be improved such that the AV-difference is decreased for all chemical species? Are different features of the timeseries needed, for example based on rural/urban locations? Or should for example CO simply be excluded? For the current manuscript it is not necessary to add and evaluate new clustering configurations, but it would be useful to see some guidance for future work.
SPECIFIC COMMENT
Line 184: Could the method used by CAMS to distribute observations in assimilation/validation set be summarized here? At lines 326-327 an essential difference is discussed, it might be useful to mention that earlier too.
The data processing requires many steps, for example outlier removal (lines 144-155), but also removal of stations extreme emission corrections in their vicinity (lines 216-221). It would be useful to summarize all selection criteria in for example a table, including the number (fraction) of removed stations.
The KSC clustering is applied using a location future, which gives a result that collects stations in geographic regions (adjacent countries and/or regions in countries). The map in the right panel of Figure 2 shows that within such cluster there are sometimes small regions with a different classification, for example the Pyrenees are part of cluster 7. Would it make sens to add features based on these "exceptions", for example the altitude of a station?
SPELL AND GRAMMER
Lines 99 and 106: "k" should be "K" as in Figure 1?
Line 126: should be "... some objects, $F_m$, such that ..."
Line 225: should be reference to Fig. A1 ?
Line 240: "the Alps"
Line 253: remove comma's ?
Line 304: ".. month .."