DocumentCode :
3408219
Title :
Massively parallel distributed feature extraction in textual data mining using HDDITM
Author :
Kuntraruk, Jirada ; Pottenger, William M.
Author_Institution :
Dept. of Comput. Sci. & Electr. Eng., Lehigh Univ., Bethlehem, PA, USA
fYear :
2001
fDate :
2001
Firstpage :
363
Lastpage :
370
Abstract :
One of the primary tasks in mining distributed textual data is feature extraction. The widespread digitization of information has created a wealth of data that requires novel approaches to feature extraction in a distributed environment. We propose a massively parallel model for feature extraction that employs unused cycles on networks of PCs/workstations in a highly distributed environment. We have developed an analytical model of the time and communication complexity of the feature extraction process in this environment based on feature extraction algorithms developed in our textual data mining research with HDDITM (Hierarchical Distributed Dynamic Indexing). We show that speedups linear in the number of processors are achievable for applications involving reduction operations based on a novel, parallel pipelined model of execution. We are in the process of validating our analytical model with empirical observations based on the extraction of features from a large number of pages on the World Wide Web
Keywords :
data mining; distributed algorithms; feature extraction; indexing; information resources; microcomputer applications; pipeline processing; text analysis; workstation clusters; HDDI; Hierarchical Distributed Dynamic Indexing; PC networks; World Wide Web pages; analytical model; communication complexity; distributed textual data mining; information digitization; linear speedup; massively parallel distributed feature extraction; parallel pipelined execution model; processor number; reduction operations; text mining; time complexity; unused cycles; workstation networks; Analytical models; Complexity theory; Computer science; Data engineering; Data mining; Distributed processing; Feature extraction; Information management; Personal communication networks; Workstations;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Distributed Computing, 2001. Proceedings. 10th IEEE International Symposium on
Conference_Location :
San Francisco, CA
ISSN :
1082-8907
Print_ISBN :
0-7695-1296-8
Type :
conf
DOI :
10.1109/HPDC.2001.945204
Filename :
945204
Link To Document :
بازگشت