DocumentCode :
1065971
Title :
An efficient subspace sampling framework for high-dimensional data reduction, selectivity estimation, and nearest-neighbor search
Author :
Aggarwal, Charu C.
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Volume :
16
Issue :
10
fYear :
2004
Firstpage :
1247
Lastpage :
1262
Abstract :
Data reduction can improve the storage, transfer time, and processing requirements of very large data sets. One of the challenges of designing effective data reduction techniques is to be able to preserve the ability to use the reduced format directly for a wide range of database and data mining applications. We propose the novel idea of hierarchical subspace sampling in order to create a reduced representation of the data. The method is naturally able to estimate the local implicit dimensionalities of each point very effectively and, thereby, create a variable dimensionality reduced representation of the data. Such a technique is very adaptive about adjusting its representation depending upon the behavior of the immediate locality of a data point. An important property of the subspace sampling technique is that the overall efficiency of compression improves with increasing database size. Because of its sampling approach, the procedure is extremely fast and scales linearly both with data set size and dimensionality. We propose new and effective solutions to problems such as selectivity estimation and approximate nearest-neighbor search. These are achieved by utilizing the locality specific subspace characteristics of the data which are revealed by the subspace sampling technique.
Keywords :
data reduction; data structures; estimation theory; tree searching; very large databases; data mining; data sets; high-dimensional data reduction; nearest-neighbor search; selectivity estimation; subspace sampling framework; Credit cards; Data compression; Data mining; Databases; Hardware; Indexing; Multidimensional systems; Nearest neighbor searches; Sampling methods; Singular value decomposition; 65; Index Terms- High dimensions; dimensionality reduction; nearest-neighbor search; selectivity estimation.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2004.49
Filename :
1324632
Link To Document :
بازگشت