Preprints
https://doi.org/10.5194/gmd-2023-185
https://doi.org/10.5194/gmd-2023-185
Submitted as: methods for assessment of models
 | 
20 Nov 2023
Submitted as: methods for assessment of models |  | 20 Nov 2023
Status: this preprint was under review for the journal GMD. A final paper is not foreseen.

Clustering analysis of very large measurement and model datasets on high performance computing platforms

Colin J. Lee, Paul A. Makar, and Joana Soares

Abstract. Spatiotemporal clustering of data is an important analytical technique with many applications in air quality, including source identification, monitoring network analysis and airshed partitioning. Hierarchical agglomerative clustering is one such algorithm, where sets of input data are grouped by a chosen similarity metric, without requiring any a priori information about the final arrangement of clusters. Modern implementations of the algorithm have O(n2 logā”(n)) computational complexity and O(n2) memory usage, where n is the number of initial clusters. This dependence can strain the resources of even very large individual computers as the number of initial clusters increases into the tens or hundreds of thousands, for example, to cluster all the points in an air-quality model’s simulation grid as part of airshed analysis (~105 to 106 time series to be clustered). Using two parallelization techniques – the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) – we have reduced the amount of wallclock time while increasing the memory available to a new hierarchical clustering program, by dividing up the program into blocks which are run on separate CPUs but communicate with each other to produce a single result. The new algorithm opens up new directions for large data analysis which had previously not been possible. Here we present a massively parallelized version of an agglomerative hierarchical clustering algorithm which is able to cluster an entire year of hourly regional air-quality model output (538x540 domain; 290,520 hourly concentration timeseries) in 12 hours 37 minutes of wallclock time, by spreading the computation across 8000 Intel® Xeon® Platinum 8830 CPU cores with a total of 2TB of RAM. We then show how the new algorithm allows a new form of air-quality analysis to be carried out starting from air-quality model output. We present maps of the different airsheds within the model domain, identifying equally unique regions for each chemical species. These regions can be used as an aid in determining the placement of surface air-quality monitors which gives the most representative sampling given a fixed number of monitors, or the number of monitors required for a given level of similarity between airsheds. We then demonstrate the new algorithm’s application towards source apportionment of very large observation data sets, through the analysis of a year of Canada’s hourly National Air Pollution Surveillance Program data, comprising 366,427 original observation vectors, a problem size that would be impossible with other source apportionment programs such as Positive Matrix Factorization.

This preprint has been withdrawn.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Colin J. Lee, Paul A. Makar, and Joana Soares

Interactive discussion

Status: closed

Comment types: AC ā€“ author | RC ā€“ referee | CC ā€“ community | EC ā€“ editor | CEC ā€“ chief editor | : Report abuse
  • RC1: 'Comment on gmd-2023-185', Anonymous Referee #1, 03 Jan 2024
  • RC2: 'Comment on gmd-2023-185', Anonymous Referee #2, 21 Jan 2024
  • AC1: 'Comment on gmd-2023-185', Colin Lee, 23 Jan 2024
  • AC2: 'Final author response to reviewer comments', Colin Lee, 16 Feb 2024
    • EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024

Interactive discussion

Status: closed

Comment types: AC ā€“ author | RC ā€“ referee | CC ā€“ community | EC ā€“ editor | CEC ā€“ chief editor | : Report abuse
  • RC1: 'Comment on gmd-2023-185', Anonymous Referee #1, 03 Jan 2024
  • RC2: 'Comment on gmd-2023-185', Anonymous Referee #2, 21 Jan 2024
  • AC1: 'Comment on gmd-2023-185', Colin Lee, 23 Jan 2024
  • AC2: 'Final author response to reviewer comments', Colin Lee, 16 Feb 2024
    • EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
Colin J. Lee, Paul A. Makar, and Joana Soares
Colin J. Lee, Paul A. Makar, and Joana Soares

Viewed

Total article views: 728 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
548 137 43 728 47 45
  • HTML: 548
  • PDF: 137
  • XML: 43
  • Total: 728
  • BibTeX: 47
  • EndNote: 45
Views and downloads (calculated since 20 Nov 2023)
Cumulative views and downloads (calculated since 20 Nov 2023)

Viewed (geographical distribution)

Total article views: 704 (including HTML, PDF, and XML) Thereof 704 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 23 Nov 2024
Download

This preprint has been withdrawn.

Short summary
Clustering is an analysis technique for finding similarities within datasets. We present a new implementation of the hierarchical clustering algorithm that is able to process much larger datasets than was previously possible, by spreading the program out over many connected computers in a high-performance computing system. We show airshed maps of a high-resolution regional model output domain, and find related air pollution profiles at monitoring stations separated by thousands of kilometers.