Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

Author

Koyutürk, Mehmet ; Grama, Ananth ; Ramakrishnan, Naren

Author_Institution

Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA

Volume

17

Issue

4

fYear

2005

fDate

4/1/2005 12:00:00 AM

Firstpage

447

Lastpage

461

Abstract

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.

Keywords

data compression; data mining; pattern classification; singular value decomposition; very large databases; association rule mining; data analysis; data classification; data clustering; data compression; data mining; discrete-attribute data sets; pattern discovery; singular value decomposition; Association rules; Clustering algorithms; Data analysis; Data mining; Discrete wavelet transforms; Frequency; Pattern analysis; Runtime; Scalability; Singular value decomposition; Index Terms- Clustering; association rules; classification; data mining; singular value decomposition.; sparse; structured and very large systems;

fLanguage

English

Journal_Title

Knowledge and Data Engineering, IEEE Transactions on

Publisher

ieee

ISSN

1041-4347

Type

jour

DOI

10.1109/TKDE.2005.55

Filename

1401886