Title :
Topic Distillation and Clustering Algorithm Based on the Topology of Pages-Keywords
Author :
Deng, Jian-shuang ; Zheng, Qi-Lun ; Peng, Hong
Author_Institution :
Dept. of Comput. Sci., South China Univ. of Technol., Guangzhou
Abstract :
Hits algorithm has gotten great success and been applied in the analysis of Web linking. Hits algorithm is used to search the authority pages and the hub pages from the results of the search engine, and it can also be used to search the Web communities. But Hits algorithm is based on the hyperlinks of the pages, it is easy to bring the problem of topic excursion. Hits algorithm requires a number of pages as the basic-set for calculating and cannot be used in plain texts. This paper introduces a new algorithm: PK-TDC which makes use of the iterative idea of Hits. PK-TDC searches the authority pages and keywords on the topology of pages-keywords, and clusters the pages by their including keywords. The experiment shows PK-TDC algorithm significantly performs in extracting the subjects and clustering not only in the pages with hyperlinks but also in the plain texts
Keywords :
Internet; classification; search engines; text analysis; Hits algorithm; PK-TDC algorithm; Web linking analysis; hyperlinks; pages-keyword topology; search engine; topic clustering algorithm; topic distillation algorithm; Algorithm design and analysis; Clustering algorithms; Couplings; Internet; Iterative algorithms; Joining processes; Machine learning; Machine learning algorithms; Search engines; Topology; Web page design; Web pages; Hits; community search; topic clustering; topic extracting;
Conference_Titel :
Machine Learning and Cybernetics, 2006 International Conference on
Conference_Location :
Dalian, China
Print_ISBN :
1-4244-0061-9
DOI :
10.1109/ICMLC.2006.258833