DocumentCode :
1245646
Title :
Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets
Author :
Koyutürk, Mehmet ; Grama, Ananth ; Ramakrishnan, Naren
Author_Institution :
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Volume :
17
Issue :
4
fYear :
2005
fDate :
4/1/2005 12:00:00 AM
Firstpage :
447
Lastpage :
461
Abstract :
This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.
Keywords :
data compression; data mining; pattern classification; singular value decomposition; very large databases; association rule mining; data analysis; data classification; data clustering; data compression; data mining; discrete-attribute data sets; pattern discovery; singular value decomposition; Association rules; Clustering algorithms; Data analysis; Data mining; Discrete wavelet transforms; Frequency; Pattern analysis; Runtime; Scalability; Singular value decomposition; Index Terms- Clustering; association rules; classification; data mining; singular value decomposition.; sparse; structured and very large systems;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2005.55
Filename :
1401886
Link To Document :
بازگشت