DocumentCode
3397574
Title
Building a specialized high performance web crawler
Author
Vasile, Adrian-Ioan ; Pavaloiu, Bujor ; Dan Cristea, Paul
Author_Institution
Biomed. Eng. Centre, Univ. Politeh. of Bucharest, Bucharest, Romania
fYear
2013
fDate
7-9 July 2013
Firstpage
183
Lastpage
186
Abstract
In this paper, we describe the design of a specialized high-performance web crawler that runs in a decentralized fashion. It is specialized for scraping data from New Media web sites such as blogs, Twitter, Facebook, etc. which in the past years has grown exponentially. The crawler is designed to be easily scalable, from a single node to hundreds or many more, to be resilient against crashes and other events, to have a low latency, to be polite and to be adaptable to various situations. We will discuss the architecture, performance bottlenecks and proper crawling etiquette.
Keywords
Web sites; search engines; crawling etiquette; high performance Web crawler; new media Web sites; Blogs; Crawlers; Feeds; Indexes; Media; Social network services; Web sites; social media; web crawler;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems, Signals and Image Processing (IWSSIP), 2013 20th International Conference on
Conference_Location
Bucharest
ISSN
2157-8672
Print_ISBN
978-1-4799-0941-4
Type
conf
DOI
10.1109/IWSSIP.2013.6623484
Filename
6623484
Link To Document