Advanced Deep Web Crawler Based on Dom

Author

Ma, Weicheng ; Chen, Xiuxia ; Shang, Wenqian

Author_Institution

Sch. of Comput., Commun. Univ. of China, Beijing, China

fYear

2012

fDate

23-26 June 2012

Firstpage

605

Lastpage

609

Abstract

Due to the fact that large amount of data today can only be stored in deep web. In view of the work done by others on deep web crawlers, it is extinct that no perfect, or even complete crawlers for deep web data has been made. To meet the needs of deep web search, we have worked out a new structure of crawler, currently concerned most on extracting data from forms - the most common type of deep web interface. Our crawler´s makes some innovative parts such as the mainframe extracting module and the algorithm to distinguish different websites with the same url using improved Bayesian classification and to expand the function to AJAX form dealing and so on. Also, Dom Tree is used to make easier and more visual the analysis and treatment of downloaded web pages.

Keywords

Bayes methods; Internet; Web sites; document handling; information retrieval; pattern classification; trees (mathematics); AJAX form; Bayesian classification; Dom Tree; Web pages; Website URL; advanced deep Web crawler; crawler structure; deep Web data; deep Web interface; deep Web search; form data extraction; mainframe extracting module; Bayesian methods; Crawlers; Data mining; Feature extraction; HTML; Web pages; XML; AJAX; Deep Web; Dom Tree; Form;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Sciences and Optimization (CSO), 2012 Fifth International Joint Conference on

Conference_Location

Harbin

Print_ISBN

978-1-4673-1365-0

Type

conf

DOI

10.1109/CSO.2012.138

Filename

6274799