• DocumentCode
    2982645
  • Title

    Delimiting boundaries of a national Web in a globalized world, UAE case study

  • Author

    BenAbdelkader, Chiraz ; Sanver, Mostafa

  • Author_Institution
    Sch. of Eng. & Comput. Sci., New York Inst. of Technol., Abu Dhabi, United Arab Emirates
  • fYear
    2011
  • fDate
    19-22 Feb. 2011
  • Firstpage
    537
  • Lastpage
    540
  • Abstract
    In this paper, we address the problem of delimiting the boundaries of a specific national Web community. We contend that previous simple techniques, mostly based on IP range and language information, are no longer effective. In reality, the Web has undergone a globalization trend, and we can no longer assume a simple one-to-one mapping between where Web content is hosted, the language it is written in, the target community it is intended for, and its geographic location. We propose a two-stage Web page filtering (classification) method for this problem: (1) a pre-crawl filter designed to quickly prune out most of the irrelevant pages without downloading them, and (2) a post-crawl filter that prunes out (most of) the remaining ones via more detailed albeit time-consuming analysis. We discuss the proposed techniques in the context of the UAE national Web, and present results on Web crawl data collected during the period June-July 2010.
  • Keywords
    Internet; information filtering; UAE; Web page filtering; language information; national Web community; one-to-one mapping; post-crawl filter; pre-crawl filter; time consuming analysis; Communities; Crawlers; IP networks; Web pages; Web server; Graph theory; Hypertext systems; Internet; Web characterization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    GCC Conference and Exhibition (GCC), 2011 IEEE
  • Conference_Location
    Dubai
  • Print_ISBN
    978-1-61284-118-2
  • Type

    conf

  • DOI
    10.1109/IEEEGCC.2011.5752578
  • Filename
    5752578