Title :
Extracting topic maps from Web pages by Web link structure and content
Author :
Mase, Motohiro ; Yamada, Seiji ; Nitta, Katsumi
Author_Institution :
Dept. of Comput. Intell. & Syst. Sci., Tokyo Inst. of Technol., Tokyo
Abstract :
We propose a framework to extract topic maps from a set of Web pages. We use the clustering method with the Web pages and extract the topic map prototypes. We introduced the following two points to the existing clustering method: The first is merging only the linked Web pages, thus extracting the underlying relationships between the topics. The second is introducing weighting based on the similarity from the contents of the Web pages and relevance between topics of pages. The relevance is based on the types of links with directories in the Web sites structure and the distance between the directories in which the pages are located. We generate the topic map prototypes by assuming that the clusters are the topics, the edges are the associations, and the Web pages related to the topics are the occurrences from the results of the clustering. Finally, users complete the prototype by labeling the topics and associations and removing the unnecessary items. We incrementally use a userpsilas evaluation of the topic maps to judge whether a Web page is unnecessary or necessary and then reduce the number of unnecessary pages. We use the relevance feedback along with a Support Vector Machine (SVM) to judge the Web pages. For this paper, at the first step, we mounted the proposed clustering method and conducted experiments to evaluate the effectiveness of extracting topic map prototypes. We eventually discussed the effectiveness of our two additional points by evaluating the extracted topic map prototypes.
Keywords :
Web sites; relevance feedback; support vector machines; Web link content; Web link structure; Web pages; Web sites; clustering method; relevance feedback; support vector machine; topic map extraction; Web pages;
Conference_Titel :
Evolutionary Computation, 2008. CEC 2008. (IEEE World Congress on Computational Intelligence). IEEE Congress on
Conference_Location :
Hong Kong
Print_ISBN :
978-1-4244-1822-0
Electronic_ISBN :
978-1-4244-1823-7
DOI :
10.1109/CEC.2008.4630954