Title :
DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide Web
Author :
Bamba, Bhuvan ; Liu, Ling ; Caverlee, James ; Padliya, Vaibhav ; Srivatsa, Mudhakar ; Bansal, Tushar ; Palekar, Mahesh ; Patrao, Joseph ; Li, Suiyang ; Singh, Ameek
Author_Institution :
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA
Abstract :
We describe DSphere - a decentralized system for crawling, indexing, searching and ranking of documents in the World Wide Web. Unlike most of the existing search technologies that depend heavily on a page-centric view of the Web, we advocate a source-centric view of the Web and propose a decentralized architecture for crawling, indexing and searching the Web in a distributed source-specific fashion. A fully decentralized crawler is developed to crawl the World Wide Web where each peer is assigned the responsibility of crawling a specific set of documents referred to as a source collection. Link analysis techniques are used for ranking documents. Traditional link analysis techniques suffer from problems like slow refresh rate and vulnerabilities to Web Spam. We propose a source-based link analysis approach, which computes fast and accurate ranking scores for all crawled documents.
Keywords :
Internet; document handling; indexing; query formulation; DSphere; World Wide Web; decentralized system; document crawling; document indexing; document ranking; document searching; source-based link analysis; source-centric view; Crawlers; Educational institutions; Fault tolerance; Indexing; Protocols; Scalability; Service oriented architecture; Uniform resource locators; Unsolicited electronic mail; Web sites;
Conference_Titel :
Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Conference_Location :
Istanbul
Print_ISBN :
1-4244-0802-4
Electronic_ISBN :
1-4244-0803-2
DOI :
10.1109/ICDE.2007.369060