DocumentCode
945839
Title
Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets
Author
Angiulli, Fabrizio ; Folino, Gianluigi
Author_Institution
Univ. delta Calabria, Rende
Volume
19
Issue
12
fYear
2007
Firstpage
1593
Lastpage
1606
Abstract
In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule.
Keywords
data mining; distributed algorithms; very large databases; distributed algorithm; distributed nearest neighbor-based condensation; nearest neighbor classification rule; noticeable data reduction; very large data sets; Clustering; Data mining; Distributed systems; and association rules; classification;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2007.190665
Filename
4358951
Link To Document