• DocumentCode
    3397574
  • Title

    Building a specialized high performance web crawler

  • Author

    Vasile, Adrian-Ioan ; Pavaloiu, Bujor ; Dan Cristea, Paul

  • Author_Institution
    Biomed. Eng. Centre, Univ. Politeh. of Bucharest, Bucharest, Romania
  • fYear
    2013
  • fDate
    7-9 July 2013
  • Firstpage
    183
  • Lastpage
    186
  • Abstract
    In this paper, we describe the design of a specialized high-performance web crawler that runs in a decentralized fashion. It is specialized for scraping data from New Media web sites such as blogs, Twitter, Facebook, etc. which in the past years has grown exponentially. The crawler is designed to be easily scalable, from a single node to hundreds or many more, to be resilient against crashes and other events, to have a low latency, to be polite and to be adaptable to various situations. We will discuss the architecture, performance bottlenecks and proper crawling etiquette.
  • Keywords
    Web sites; search engines; crawling etiquette; high performance Web crawler; new media Web sites; Blogs; Crawlers; Feeds; Indexes; Media; Social network services; Web sites; social media; web crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Signals and Image Processing (IWSSIP), 2013 20th International Conference on
  • Conference_Location
    Bucharest
  • ISSN
    2157-8672
  • Print_ISBN
    978-1-4799-0941-4
  • Type

    conf

  • DOI
    10.1109/IWSSIP.2013.6623484
  • Filename
    6623484