• DocumentCode
    2124378
  • Title

    A Memory Efficient Approach for Crawling Language Specific Web: The Arabic Web as a Case Study

  • Author

    Ezzat, D. ; Abdeen, M. ; Tolba, M.F.

  • Author_Institution
    Fac. of Comput. & Inf. & Sci., Ain-Shams Univ., Cairo
  • fYear
    2009
  • fDate
    3-5 April 2009
  • Firstpage
    584
  • Lastpage
    587
  • Abstract
    Web crawlers represent a significant component in Web search engines. They are responsible for making a local copy of Web pages and keeping this local copy up-to-date by periodically refreshing these pages. The decision to refresh a Web page is a tradeoff between the resource utilization and the freshness of the page content. There are various policies as to when to perform a page refresh. A major factor that determines the refresh policy is the change rate of a Web page. In this paper we address the problem of page refresh for the Arabic Web. We present a novel approach that improves the re-crawl scheduling. The proposed technique modifies the information longevity approach to be more suitable for Arabic Web pages. This is done by extracting the Arabic content, and excluding stop list and redundancies that might not contribute significantly to the meaning. This technique saves the scarce memory space in a semantic Arabic Web search engine.
  • Keywords
    natural languages; resource allocation; scheduling; search engines; semantic Web; storage management; Arabic Web crawling language; Web page; memory space; resource utilization; scheduling; semantic Web search engine; Crawlers; Curve fitting; Data mining; History; Information management; Memory management; Resource management; Search engines; Web pages; Web search; Arabic search engines; Refreshing web pages; Web crawling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Management and Engineering, 2009. ICIME '09. International Conference on
  • Conference_Location
    Kuala Lumpur
  • Print_ISBN
    978-0-7695-3595-1
  • Type

    conf

  • DOI
    10.1109/ICIME.2009.105
  • Filename
    5077102