DocumentCode :
3397574
Title :
Building a specialized high performance web crawler
Author :
Vasile, Adrian-Ioan ; Pavaloiu, Bujor ; Dan Cristea, Paul
Author_Institution :
Biomed. Eng. Centre, Univ. Politeh. of Bucharest, Bucharest, Romania
fYear :
2013
fDate :
7-9 July 2013
Firstpage :
183
Lastpage :
186
Abstract :
In this paper, we describe the design of a specialized high-performance web crawler that runs in a decentralized fashion. It is specialized for scraping data from New Media web sites such as blogs, Twitter, Facebook, etc. which in the past years has grown exponentially. The crawler is designed to be easily scalable, from a single node to hundreds or many more, to be resilient against crashes and other events, to have a low latency, to be polite and to be adaptable to various situations. We will discuss the architecture, performance bottlenecks and proper crawling etiquette.
Keywords :
Web sites; search engines; crawling etiquette; high performance Web crawler; new media Web sites; Blogs; Crawlers; Feeds; Indexes; Media; Social network services; Web sites; social media; web crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Signals and Image Processing (IWSSIP), 2013 20th International Conference on
Conference_Location :
Bucharest
ISSN :
2157-8672
Print_ISBN :
978-1-4799-0941-4
Type :
conf
DOI :
10.1109/IWSSIP.2013.6623484
Filename :
6623484
Link To Document :
بازگشت