• DocumentCode
    114113
  • Title

    Web page download scheduling policies for green web crawling

  • Author

    Hatzi, Vassiliki ; Cambazoglu, B. Barla ; Koutsopoulos, Iordanis

  • Author_Institution
    CERTH, Univ. of Thessaly, Volos, Greece
  • fYear
    2014
  • fDate
    17-19 Sept. 2014
  • Firstpage
    56
  • Lastpage
    60
  • Abstract
    A web crawler is responsible for discovering new web pages on the Web as well as for refreshing the content of already downloaded pages. During these operations, it can issue a huge number of page download requests to the servers in the Web. These requests, in turn, increase the energy consumption of the servers as hardware resources are used when serving the requested pages. This has the side-effect of increasing the carbon footprint of servers. In this work, we introduce the problem of green web crawling from a set of remote web servers, where the goal is to reduce the carbon footprint incurred by a large-scale web crawler. We consider a scenario where both freshness of downloaded pages and carbon emissions at remote servers need to be taken into account. We present various heuristics for prioritizing the page download requests as a means to study the relative importance of different parameters. We conduct experiments on a real data set that involves a large server collection involving two billion pages. The results indicate that the carbon footprint generated by a crawler during its external operations can be considerably reduced without compromising the freshness of pages. Our work draws guidelines for the design of large-scale commercial search engine companies, which need to comply with certain greenness regulations.
  • Keywords
    Internet; green computing; scheduling; search engines; Web page discovery; Web page download scheduling policies; carbon emissions; carbon footprint reduction; green Web crawling; greenness regulations; hardware resources; large-scale Web crawler; large-scale commercial search engine company; page download request; page freshness; remote Web servers; server carbon footprint; server energy consumption; Carbon dioxide; Crawlers; Green products; Indexes; Web pages; Web servers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software, Telecommunications and Computer Networks (SoftCOM), 2014 22nd International Conference on
  • Conference_Location
    Split
  • Type

    conf

  • DOI
    10.1109/SOFTCOM.2014.7039136
  • Filename
    7039136