• DocumentCode
    3063641
  • Title

    Advanced Deep Web Crawler Based on Dom

  • Author

    Ma, Weicheng ; Chen, Xiuxia ; Shang, Wenqian

  • Author_Institution
    Sch. of Comput., Commun. Univ. of China, Beijing, China
  • fYear
    2012
  • fDate
    23-26 June 2012
  • Firstpage
    605
  • Lastpage
    609
  • Abstract
    Due to the fact that large amount of data today can only be stored in deep web. In view of the work done by others on deep web crawlers, it is extinct that no perfect, or even complete crawlers for deep web data has been made. To meet the needs of deep web search, we have worked out a new structure of crawler, currently concerned most on extracting data from forms - the most common type of deep web interface. Our crawler´s makes some innovative parts such as the mainframe extracting module and the algorithm to distinguish different websites with the same url using improved Bayesian classification and to expand the function to AJAX form dealing and so on. Also, Dom Tree is used to make easier and more visual the analysis and treatment of downloaded web pages.
  • Keywords
    Bayes methods; Internet; Web sites; document handling; information retrieval; pattern classification; trees (mathematics); AJAX form; Bayesian classification; Dom Tree; Web pages; Website URL; advanced deep Web crawler; crawler structure; deep Web data; deep Web interface; deep Web search; form data extraction; mainframe extracting module; Bayesian methods; Crawlers; Data mining; Feature extraction; HTML; Web pages; XML; AJAX; Deep Web; Dom Tree; Form;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Sciences and Optimization (CSO), 2012 Fifth International Joint Conference on
  • Conference_Location
    Harbin
  • Print_ISBN
    978-1-4673-1365-0
  • Type

    conf

  • DOI
    10.1109/CSO.2012.138
  • Filename
    6274799