Title :
Phrase Ranking and Wikipedia Based Cluster Labeling
Author :
Chinthala, Pradyumna Reddy
Author_Institution :
Goa Campus, Dept. of Comput. Sci., BITS Pilani, Zuarinagar, India
Abstract :
Automatically labeling document clusters with words which indicate their topics is a relatively new and active research field. The most frequently used process, labeling with the most frequent words in the clusters, turns out using several words that are virtually void of descriptive power even after traditional stop words are eliminated. Another procedure, labeling with the most anticipated words, often include rather obscure results. We present Phrase Rank, a variation of the Page Rank algorithm based on relational graph representation of the content of web document collections. Phrase Rank achieves segregation and ranking of discriminative phrases higher than the ambiguous Phrases followed by common phrases. Thus a set of important text features are first extracted from the cluster documents. Further we use these features to extract cluster labels from the external knowledge sources such as pre-categorized knowledge of Wikipedia. We experiment with a test dataset to demonstrate the efficacy of Phrase Rank algorithm.
Keywords :
Web sites; graph theory; pattern clustering; text analysis; Web document collections; Wikipedia based cluster labeling; active research field; cluster documents; document clusters; most anticipated words; most frequent words; page rank algorithm; phrase ranking; relational graph representation; text feature extraction; topics; Clustering algorithms; Electronic publishing; Encyclopedias; Games; Internet; Labeling; Cluster labeling; PageRank; Phrase ranking; Wikipedia;
Conference_Titel :
Machine Intelligence and Research Advancement (ICMIRA), 2013 International Conference on
Conference_Location :
Katra
DOI :
10.1109/ICMIRA.2013.44