Phrase Ranking and Wikipedia Based Cluster Labeling

Author

Chinthala, Pradyumna Reddy

Author_Institution

Goa Campus, Dept. of Comput. Sci., BITS Pilani, Zuarinagar, India

fYear

2013

fDate

21-23 Dec. 2013

Firstpage

199

Lastpage

202

Abstract

Automatically labeling document clusters with words which indicate their topics is a relatively new and active research field. The most frequently used process, labeling with the most frequent words in the clusters, turns out using several words that are virtually void of descriptive power even after traditional stop words are eliminated. Another procedure, labeling with the most anticipated words, often include rather obscure results. We present Phrase Rank, a variation of the Page Rank algorithm based on relational graph representation of the content of web document collections. Phrase Rank achieves segregation and ranking of discriminative phrases higher than the ambiguous Phrases followed by common phrases. Thus a set of important text features are first extracted from the cluster documents. Further we use these features to extract cluster labels from the external knowledge sources such as pre-categorized knowledge of Wikipedia. We experiment with a test dataset to demonstrate the efficacy of Phrase Rank algorithm.

Keywords

Web sites; graph theory; pattern clustering; text analysis; Web document collections; Wikipedia based cluster labeling; active research field; cluster documents; document clusters; most anticipated words; most frequent words; page rank algorithm; phrase ranking; relational graph representation; text feature extraction; topics; Clustering algorithms; Electronic publishing; Encyclopedias; Games; Internet; Labeling; Cluster labeling; PageRank; Phrase ranking; Wikipedia;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Intelligence and Research Advancement (ICMIRA), 2013 International Conference on

Conference_Location

Katra

Type

conf

DOI

10.1109/ICMIRA.2013.44

Filename

6918821