Title :
An incremental clustering crawler for community-limited search
Author :
Kim, Gye-Jeong ; Whang, Kyu-Young ; Kim, Min-Soo ; Lim, Hyo-Sang ; Ki-Hoon Lee ; Lee, Ki-Hoon
Abstract :
We propose an incremental clustering crawler, a novel algorithm for finding communities for community-limited search in the Web. A Web community is a set of semantically related sites found through link-based clustering. The key idea of the proposed algorithm is to perform clustering incrementally while crawling is in progress. This algorithm does not need to crawl all the Web pages a priori, but needs to crawl only as many Web pages as are relevant to the clusters that are being formed. This ability to crawl on the fly is an important advantage since it is infeasible to crawl the entire set of Web pages in the world and since we often do not even know which Web pages or sites to crawl. Another advantage is that the time spent on clustering is reduced because at any time the clustering is performed on only the relevant Web pages collected thus far. An apparent disadvantage is that the resulting clusters are not optimal since the algorithm does not have all the crawled sites available at the time of clustering. Experiments show, however, that the achieved cluster quality is comparable to the optimal cluster quality which, in our experiments, is achieved using the minimum spanning tree clustering algorithm.
Keywords :
Web sites; pattern clustering; search engines; semantic Web; Web community; Web page; Web site; community-limited search; incremental clustering crawler; Clustering algorithms; Computer science; Crawlers; Information resources; Search engines; Uniform resource locators; Web pages; Web search;
Conference_Titel :
Applications of Digital Information and Web Technologies, 2009. ICADIWT '09. Second International Conference on the
Conference_Location :
London
Print_ISBN :
978-1-4244-4456-4
Electronic_ISBN :
978-1-4244-4457-1
DOI :
10.1109/ICADIWT.2009.5273940