DocumentCode :
3080830
Title :
A popularity-based URL ordering algorithm for crawlers
Author :
Chandramouli, Aravind ; Gauch, Susan ; Eno, Joshua
Author_Institution :
Univ. of Kansas, Lawrence, KS, USA
fYear :
2010
fDate :
13-15 May 2010
Firstpage :
556
Lastpage :
562
Abstract :
Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering algorithm that exploits the access count information present in the Web logs on the individual Websites. In particular, we develop algorithms based on internal and external counts and by using this popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl. We perform our experiments on two data sets using the Web logs from university and CiteSeer Websites and, on these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google´s PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl.
Keywords :
Internet; Web sites; information retrieval; Web crawlers; Web logs; Web sites; popularity-based URL ordering algorithm; social network; uniform resource locator; Crawlers; Decision support systems; Helium; Information retrieval; Social network services; Uniform resource locators; page ranking; social content; url ordering; web crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Human System Interactions (HSI), 2010 3rd Conference on
Conference_Location :
Rzeszow
Print_ISBN :
978-1-4244-7560-5
Type :
conf
DOI :
10.1109/HSI.2010.5514512
Filename :
5514512
Link To Document :
بازگشت