• DocumentCode
    692894
  • Title

    Scalable parallel OPTICS data clustering using graph algorithmic techniques

  • Author

    Ali Patwary, Md Mostofa ; Palsetia, Diana ; Agrawal, Ankit ; Wei-keng Liao ; Manne, Fredrik ; Choudhary, Alok

  • Author_Institution
    Northwestern Univ., Evanston, IL, USA
  • fYear
    2013
  • fDate
    17-22 Nov. 2013
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIM´s Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.
  • Keywords
    data mining; data structures; parallel algorithms; pattern clustering; reachability analysis; trees (mathematics); MPI; OpenMP; POPTICS; PRIM minimum spanning tree algorithm; adjustable reachability distance thresholds; arbitrary-shaped clusters; data mining technique; disjoint-set data structure; distributed cluster extraction; distributed-memory machine; floating point numbers; graph algorithmic techniques; hierarchical density-based data clustering algorithm; high dimensional datasets; noise elimination; scalable parallel OPTICS algorithm; scalable parallel OPTICS data clustering; sequential data access order; shared-memory machine; Algorithm design and analysis; Clustering algorithms; Complexity theory; Data structures; Noise; Optics; Vegetation; Density-based clustering; Disjoint-set data structure; Minimum spanning tree; Union-Find algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
  • Conference_Location
    Denver, CO
  • Print_ISBN
    978-1-4503-2378-9
  • Type

    conf

  • DOI
    10.1145/2503210.2503255
  • Filename
    6877482