Title :
Building a specialized high performance web crawler
Author :
Vasile, Adrian-Ioan ; Pavaloiu, Bujor ; Dan Cristea, Paul
Author_Institution :
Biomed. Eng. Centre, Univ. Politeh. of Bucharest, Bucharest, Romania
Abstract :
In this paper, we describe the design of a specialized high-performance web crawler that runs in a decentralized fashion. It is specialized for scraping data from New Media web sites such as blogs, Twitter, Facebook, etc. which in the past years has grown exponentially. The crawler is designed to be easily scalable, from a single node to hundreds or many more, to be resilient against crashes and other events, to have a low latency, to be polite and to be adaptable to various situations. We will discuss the architecture, performance bottlenecks and proper crawling etiquette.
Keywords :
Web sites; search engines; crawling etiquette; high performance Web crawler; new media Web sites; Blogs; Crawlers; Feeds; Indexes; Media; Social network services; Web sites; social media; web crawler;
Conference_Titel :
Systems, Signals and Image Processing (IWSSIP), 2013 20th International Conference on
Conference_Location :
Bucharest
Print_ISBN :
978-1-4799-0941-4
DOI :
10.1109/IWSSIP.2013.6623484