• DocumentCode
    1245646
  • Title

    Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

  • Author

    Koyutürk, Mehmet ; Grama, Ananth ; Ramakrishnan, Naren

  • Author_Institution
    Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
  • Volume
    17
  • Issue
    4
  • fYear
    2005
  • fDate
    4/1/2005 12:00:00 AM
  • Firstpage
    447
  • Lastpage
    461
  • Abstract
    This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.
  • Keywords
    data compression; data mining; pattern classification; singular value decomposition; very large databases; association rule mining; data analysis; data classification; data clustering; data compression; data mining; discrete-attribute data sets; pattern discovery; singular value decomposition; Association rules; Clustering algorithms; Data analysis; Data mining; Discrete wavelet transforms; Frequency; Pattern analysis; Runtime; Scalability; Singular value decomposition; Index Terms- Clustering; association rules; classification; data mining; singular value decomposition.; sparse; structured and very large systems;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2005.55
  • Filename
    1401886