Title :
Correlation preserving discretization
Author :
Mehta, Sameep ; Parthasarathy, Srinivasan ; Yang, Hui
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., USA
Abstract :
Discretization is a crucial preprocessing primitive for a variety of data warehousing and mining tasks. In this article we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate datasets. The algorithm leverages the underlying correlation structure in the dataset to obtain the discrete intervals, and ensures that the inherent correlations are preserved. The approach also extends easily to datasets containing missing values. We demonstrate the efficacy of the approach on real datasets and as a preprocessing step for both classification and frequent item set mining tasks. We also show that the intervals are meaningful and can uncover hidden patterns in data.
Keywords :
data mining; data warehouses; principal component analysis; PCA-based unsupervised algorithm; classification; correlation preserving discretization; correlation structure; data mining; data warehousing; frequent item set mining; missing data; multivariate dataset; unsupervised discretization; Classification algorithms; Classification tree analysis; Computer science; Data engineering; Data mining; Data preprocessing; Decision trees; Discrete transforms; Itemsets; Warehousing; Missing Data; Unsupervised Discretization;
Conference_Titel :
Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
Print_ISBN :
0-7695-2142-8
DOI :
10.1109/ICDM.2004.10007