DocumentCode :
589261
Title :
Scalable Overlapping Co-clustering of Word-Document Data
Author :
Franca, F.O.D.
Author_Institution :
Center of Math., Comput. & Cognition (CMCC), Fed. Univ. of ABC (UFABC), Santo Andre, Brazil
Volume :
1
fYear :
2012
fDate :
12-15 Dec. 2012
Firstpage :
464
Lastpage :
467
Abstract :
Text clustering is used on a variety of applications such as content-based recommendation, categorization, summarization, information retrieval and automatic topic extraction. Since most pair of documents usually shares just a small percentage of words, the dataset representation tends to become very sparse, thus the need of using a similarity metric capable of a partial matching of a set of features. The technique known as Co-Clustering is capable of finding several clusters inside a dataset with each cluster composed of just a subset of the object and feature sets. In word-document data this can be useful to identify the clusters of documents pertaining to the same topic, even though they share just a small fraction of words. In this paper a scalable co-clustering algorithm is proposed using the Locality-sensitive hashing technique in order to find co-clusters of documents. The proposed algorithm will be tested against other co-clustering and traditional algorithms in well known datasets. The results show that this algorithm is capable of finding clusters more accurately than other approaches while maintaining a linear complexity.
Keywords :
data structures; pattern clustering; text analysis; dataset representation; locality-sensitive hashing technique; scalable overlapping coclustering; text clustering; word-document data clustering; Accuracy; Clustering algorithms; Complexity theory; Feature extraction; Machine learning; Mutual information; Text mining; co-clustering; hashing; text clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4673-4651-1
Type :
conf
DOI :
10.1109/ICMLA.2012.84
Filename :
6406666
Link To Document :
بازگشت