Optimizing Parallel Clustering Throughput in Shared Memory

Michael Gowanlock, David M. Blair, Victor Pankratius

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

This article studies the optimization of parallel clustering throughput in the context of variant-based parallelism, which exploits commonalities and reuse among variant computations for multithreading scalability. This direction is motivated by challenging scientific applications where scientists have to execute multiple runs of clustering algorithms with different parameters to determine which ones best explain phenomena observed in empirical data. To make this process more efficient, we propose a novel set of optimizations to maximize the throughput of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a frequently used algorithm for scientific data mining in astronomy, geoscience, and many other fields. Our approach executes multiple algorithm variants in parallel, computes clusters concurrently, and leverages heuristics to maximize the reuse of results from completed variants. As scientific datasets continue to grow, maximizing clustering throughput with our techniques may accelerate the search and identification of natural phenomena of interest with computational support, i.e., Computer-Aided Discovery. We present evaluations on a whole spectrum of datasets, such as geoscience data on space weather phenomena, astronomical data from the Sloan Digital Sky Survey on intermediate-redshift galaxies, as well as synthetic datasets to characterize performance properties. Selected results show a 1,115 percent performance improvement due to indexing tailored for variant-based clustering, and a 2,209 percent performance improvement when applying all of our proposed optimizations.

Original languageEnglish (US)
Article number7865993
Pages (from-to)2595-2607
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Volume28
Issue number9
DOIs
StatePublished - Sep 1 2017
Externally publishedYes

Keywords

  • DBSCAN
  • Parallel clustering
  • clustering throughput
  • computer-aided discovery
  • data mining

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Optimizing Parallel Clustering Throughput in Shared Memory'. Together they form a unique fingerprint.

Cite this