DocumentCode :
3245421
Title :
A Forwarding-Based Task Scheduling Algorithm for Distributed Web Crawling over DHTs
Author :
Xu, Xiao ; Zhang, Wei-Zhe ; Zhang, Hong-Li ; Fang, Bin-Xing ; Liu, Xin-Ran
Author_Institution :
Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., Harbin, China
fYear :
2009
fDate :
8-11 Dec. 2009
Firstpage :
854
Lastpage :
859
Abstract :
Distributed Web crawling (DWC) over DHTs is proposed to solve the bottlenecks in the traditional Web crawling. The core of this kind of system is its fully distributed task scheduling mechanism in which the crawlers are treated as peers and the crawlees are treated as resources maintained by the peers. A system model based on the content addressable network (CAN) can further optimize the scheduling mechanism by exploiting the network proximity of the crawlers and the crawlees. In this paper, we propose a new method for CAN in order to achieve load balancing in the CAN-based DWC system. The method not only keeps the load balancing among peers but also keeps the distance between peers and resources very short in our simulations. The shortened peer-resource distance fulfills the need of shortening crawler-crawlee latencies.
Keywords :
Internet; cryptography; resource allocation; scheduling; content addressable network; crawler-crawlee latencies; distributed Web crawling; distributed hash tables; forwarding-based task scheduling algorithm; load balancing; network proximity; Computer networks; Computer science; Crawlers; Delay; Load management; Peer to peer computing; Processor scheduling; Robustness; Scalability; Scheduling algorithm; Content Addressable Network; DHT; distributed Web crawling; task scheduling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
Conference_Location :
Shenzhen
ISSN :
1521-9097
Print_ISBN :
978-1-4244-5788-5
Type :
conf
DOI :
10.1109/ICPADS.2009.29
Filename :
5395331
Link To Document :
بازگشت