DocumentCode :
1065987
Title :
Efficient phrase-based document indexing for Web document clustering
Author :
Hammouda, Khaled M. ; Kamel, Mohamed S.
Author_Institution :
Dept. of Syst. Design Eng., Waterloo Univ., Ont., Canada
Volume :
16
Issue :
10
fYear :
2004
Firstpage :
1279
Lastpage :
1296
Abstract :
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Keywords :
Internet; data mining; document handling; indexing; pattern clustering; Web document clustering; Web mining; automatic categorization; document data set; document index graph; document structure; pair-wise document similarity distribution; phrase matching; phrase-based document index model; search engine; single term analysis; vector space model; Artificial intelligence; Clustering algorithms; Clustering methods; Data mining; Data models; Indexing; Information retrieval; Taxonomy; Text mining; Web mining; 65; Index Terms- Web mining; document clustering; document index graph; document similarity; document structure; phrase matching.; phrase-based indexing;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2004.58
Filename :
1324634
Link To Document :
بازگشت