Submitted as: methods for assessment of models
20 Nov 2023
Submitted as: methods for assessment of models |  | 20 Nov 2023
Status: this preprint is currently under review for the journal GMD.

Clustering analysis of very large measurement and model datasets on high performance computing platforms

Colin J. Lee, Paul A. Makar, and Joana Soares

Abstract. Spatiotemporal clustering of data is an important analytical technique with many applications in air quality, including source identification, monitoring network analysis and airshed partitioning. Hierarchical agglomerative clustering is one such algorithm, where sets of input data are grouped by a chosen similarity metric, without requiring any a priori information about the final arrangement of clusters. Modern implementations of the algorithm have O(n2 log⁡(n)) computational complexity and O(n2) memory usage, where n is the number of initial clusters. This dependence can strain the resources of even very large individual computers as the number of initial clusters increases into the tens or hundreds of thousands, for example, to cluster all the points in an air-quality model’s simulation grid as part of airshed analysis (~105 to 106 time series to be clustered). Using two parallelization techniques – the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) – we have reduced the amount of wallclock time while increasing the memory available to a new hierarchical clustering program, by dividing up the program into blocks which are run on separate CPUs but communicate with each other to produce a single result. The new algorithm opens up new directions for large data analysis which had previously not been possible. Here we present a massively parallelized version of an agglomerative hierarchical clustering algorithm which is able to cluster an entire year of hourly regional air-quality model output (538x540 domain; 290,520 hourly concentration timeseries) in 12 hours 37 minutes of wallclock time, by spreading the computation across 8000 Intel® Xeon® Platinum 8830 CPU cores with a total of 2TB of RAM. We then show how the new algorithm allows a new form of air-quality analysis to be carried out starting from air-quality model output. We present maps of the different airsheds within the model domain, identifying equally unique regions for each chemical species. These regions can be used as an aid in determining the placement of surface air-quality monitors which gives the most representative sampling given a fixed number of monitors, or the number of monitors required for a given level of similarity between airsheds. We then demonstrate the new algorithm’s application towards source apportionment of very large observation data sets, through the analysis of a year of Canada’s hourly National Air Pollution Surveillance Program data, comprising 366,427 original observation vectors, a problem size that would be impossible with other source apportionment programs such as Positive Matrix Factorization.

Colin J. Lee et al.

Status: open (until 15 Jan 2024)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse

Colin J. Lee et al.

Colin J. Lee et al.


Total article views: 89 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
56 31 2 89 1 0
  • HTML: 56
  • PDF: 31
  • XML: 2
  • Total: 89
  • BibTeX: 1
  • EndNote: 0
Views and downloads (calculated since 20 Nov 2023)
Cumulative views and downloads (calculated since 20 Nov 2023)

Viewed (geographical distribution)

Total article views: 89 (including HTML, PDF, and XML) Thereof 89 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 07 Dec 2023
Short summary
Clustering is an analysis technique for finding similarities within datasets. We present a new implementation of the hierarchical clustering algorithm that is able to process much larger datasets than was previously possible, by spreading the program out over many connected computers in a high-performance computing system. We show airshed maps of a high-resolution regional model output domain, and find related air pollution profiles at monitoring stations separated by thousands of kilometers.