DocumentCode
3436425
Title
CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection
Author
Chen, Rui ; Desai, Bipin C. ; Zhou, Cong
Author_Institution
Concordia Univ., Montreal
fYear
2007
fDate
6-8 Sept. 2007
Firstpage
93
Lastpage
101
Abstract
With the explosion of the Web, focused Web crawlers are gaining attention. Focused Web crawlers aim at finding Web pages related to the pre-defined topic. CINDI Robot is a focused Web crawler devoted to finding computer science and software engineering academic documents. We propose a multi-level inspection scheme to discover relevant Web pages. Through this multi-level inspection scheme, the text feature of the content contributes to the classification; furthermore other Web characteristics, such as URL pattern, anchor text and so on, assist the decision process. The experiment result demonstrates this multi-level inspection method outperforms other traditional methods.
Keywords
Internet; classification; indexing; information retrieval; online front-ends; CINDI robot; URL pattern; Web pages; World Wide Web; computer science documents; focused Web crawler; intelligent Web crawler; multilevel inspection; software engineering academic documents; Computer science; Crawlers; Inspection; Intelligent robots; Internet; Search engines; Software engineering; Statistical analysis; Uniform resource locators; Web pages; Bayes classifier; Naïve; SVM classifier; focused web crawler; graph; multi-level inspection; revised context; tunneling;
fLanguage
English
Publisher
ieee
Conference_Titel
Database Engineering and Applications Symposium, 2007. IDEAS 2007. 11th International
Conference_Location
Banff, Alta.
ISSN
1098-8068
Print_ISBN
978-0-7695-2947-9
Type
conf
DOI
10.1109/IDEAS.2007.4318093
Filename
4318093
Link To Document