• DocumentCode
    590178
  • Title

    Defense response of search engine websites to non cooperating crawlers

  • Author

    Dev Chandna, Rishabh ; Chaubey, P. ; Gupta, S.C.

  • Author_Institution
    Indian Inst. of Technol. (BHU), Varanasi, India
  • fYear
    2012
  • fDate
    Oct. 30 2012-Nov. 2 2012
  • Firstpage
    219
  • Lastpage
    223
  • Abstract
    Robots.txt non cooperating web crawlers are unwanted by any website as they can create serious negative impact in terms of denial of service, privacy and cost. Defense mechanisms such as automated content access protocol, captcha, web crawler trap, real time bot detection etc. have been proposed to protect websites from unwanted crawler access. Although, the extent of these mechanisms being practically applied against such crawlers is not known clearly. In this paper we present an investigation carried out to get insights about defense mechanisms used by websites against robots.txt non cooperating web crawlers. This investigation is limited only to search engine class of websites. MBot, a self-developed non cooperating web crawler is the primary tool used for investigation. On investigation we find that search engine websites do have defense mechanisms to prevent non cooperating crawler access on them. Although, absence of any kind of defense phenomena to prevent MBot´s access is also observed on some of the investigated websites. Robustness in observed defense mechanisms to basic network and application parameters like proxy, port number, user agent, IP address etc. is also observed.
  • Keywords
    Web sites; data privacy; information retrieval; search engines; MBot; Robots.txt noncooperating Web crawlers; Web crawler trap; automated content access protocol; captcha; defense mechanisms; defense response; real time bot detection; search engine Websites; self-developed non cooperating Web crawler; Communications technology; Decision support systems; Helium; defense mechanism; robots exclusion protocol; robots.txt; web crawler; website;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Communication Technologies (WICT), 2012 World Congress on
  • Conference_Location
    Trivandrum
  • Print_ISBN
    978-1-4673-4806-5
  • Type

    conf

  • DOI
    10.1109/WICT.2012.6409078
  • Filename
    6409078