• DocumentCode
    773419
  • Title

    Integrating K-means clustering with a relational DBMS using SQL

  • Author

    Ordonez, Carlos

  • Author_Institution
    Teradata, NCR Corp., San Diego, USA
  • Volume
    18
  • Issue
    2
  • fYear
    2006
  • Firstpage
    188
  • Lastpage
    201
  • Abstract
    Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2) an optimized version based on improved data organization, efficient indexing, sufficient statistics, and rewritten queries, and 3) an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. We experimentally show the proposed K-means implementations work correctly and can cluster large data sets. We identify which K-means computations are more critical for performance. The optimized and incremental K-means implementations exhibit linear scalability. We compare K-means implementations in SQL and C++ with respect to speed and scalability and we also study the time to export data sets outside of the DBMS. Experiments show that SQL overhead is significant for small data sets, but relatively low for large data sets, whereas export times become a bottleneck for C++.
  • Keywords
    C++ language; SQL; data mining; database indexing; pattern clustering; query processing; relational databases; very large databases; C++; K-means clustering algorithm integration; SQL; data mining algorithm integration; data organization; database indexing; database programmer; database query rewriting; large data set clustering; relational DBMS; Clustering algorithms; Computer languages; Convergence; Data mining; Indexing; Partitioning algorithms; Programming profession; Relational databases; Scalability; Statistics; Index Terms- Clustering; K-means; SQL; relational DBMS.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.31
  • Filename
    1563982