Title :
Finding a Web community by maximum flow algorithm with HITS score based capacity
Author :
Imafuji, Noriko ; Kitsuregawa, Masaru
Author_Institution :
Inst. of Ind. Sci., Univ. of Tokyo, Japan
Abstract :
We propose an edge capacity based on hub and authority scores, and examine the effects of using the edge capacity on the method for extracting Web communities using maximum flow algorithm proposed by G. Flake et al. (2000). A Web community is a collection of Web pages in which a common (or related) topic is taken up. In recent years, various methods for finding Web communities have been proposed. G. Flake et al.\´s method, which is based on maximum flow algorithm, has a big advantages: "topic drift" does not easily occur. On the other hand, it sets the edge capacity to a fixed value for every edge, which is one of the major cause of failing to obtain a proper Web community. Our approach, which is using HITS score based edge capacity, effectively extracts Web pages retaining well-balanced in both global and local relations to the given seed node. We examined the effects by the experiments for randomly selected 20 topics using Web archives in Japan crawled in 2002. The result confirmed that the average precision rose approximately 20%.
Keywords :
Internet; Web sites; graph theory; information retrieval; HITS score based capacity; Web archives; Web community; Web pages; World Wide Web; authority scores; edge capacity; experiments; graphs; hub scores; maximum flow algorithm; Bipartite graph; Database systems; Performance evaluation; Web pages;
Conference_Titel :
Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings. Eighth International Conference on
Conference_Location :
Kyoto, Japan
Print_ISBN :
0-7695-1895-8
DOI :
10.1109/DASFAA.2003.1192373