• DocumentCode
    3436425
  • Title

    CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection

  • Author

    Chen, Rui ; Desai, Bipin C. ; Zhou, Cong

  • Author_Institution
    Concordia Univ., Montreal
  • fYear
    2007
  • fDate
    6-8 Sept. 2007
  • Firstpage
    93
  • Lastpage
    101
  • Abstract
    With the explosion of the Web, focused Web crawlers are gaining attention. Focused Web crawlers aim at finding Web pages related to the pre-defined topic. CINDI Robot is a focused Web crawler devoted to finding computer science and software engineering academic documents. We propose a multi-level inspection scheme to discover relevant Web pages. Through this multi-level inspection scheme, the text feature of the content contributes to the classification; furthermore other Web characteristics, such as URL pattern, anchor text and so on, assist the decision process. The experiment result demonstrates this multi-level inspection method outperforms other traditional methods.
  • Keywords
    Internet; classification; indexing; information retrieval; online front-ends; CINDI robot; URL pattern; Web pages; World Wide Web; computer science documents; focused Web crawler; intelligent Web crawler; multilevel inspection; software engineering academic documents; Computer science; Crawlers; Inspection; Intelligent robots; Internet; Search engines; Software engineering; Statistical analysis; Uniform resource locators; Web pages; Bayes classifier; Naïve; SVM classifier; focused web crawler; graph; multi-level inspection; revised context; tunneling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Applications Symposium, 2007. IDEAS 2007. 11th International
  • Conference_Location
    Banff, Alta.
  • ISSN
    1098-8068
  • Print_ISBN
    978-0-7695-2947-9
  • Type

    conf

  • DOI
    10.1109/IDEAS.2007.4318093
  • Filename
    4318093