• DocumentCode
    3277550
  • Title

    Medium sized crawling made fast and easy through Lumbricus webis

  • Author

    Felicioli, Claudio ; Geraci, Filippo ; Pellegrini, Marco

  • Author_Institution
    Ist. di Inf. e Telematica, CNR, Pisa, Italy
  • Volume
    4
  • fYear
    2011
  • fDate
    10-13 July 2011
  • Firstpage
    1920
  • Lastpage
    1926
  • Abstract
    Web crawlers have become popular tools for collecting large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis, (short L.webis) a modular crawling infrastructure built to mine data from the .it domain and portions of the web reachable from it. The purpose of our crawler is to support gathering of advanced statistics, and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as “.it” in about one week.
  • Keywords
    Internet; distributed processing; statistical analysis; Italian Web; Lumbricus webis; Web crawlers; distributed architecture; medium sized crawling; statistics analysis; structural analysis; Instruction sets; Random access memory; Resource management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2011 International Conference on
  • Conference_Location
    Guilin
  • ISSN
    2160-133X
  • Print_ISBN
    978-1-4577-0305-8
  • Type

    conf

  • DOI
    10.1109/ICMLC.2011.6016946
  • Filename
    6016946