• DocumentCode
    1801504
  • Title

    The study of web information extraction technology based on VietSpider

  • Author

    Gao Tao ; Wu Hongna

  • Author_Institution
    Beijing Inst. of Technol., Beijing, China
  • fYear
    2013
  • fDate
    26-28 July 2013
  • Firstpage
    8465
  • Lastpage
    8470
  • Abstract
    Currently network information extraction technology is a hot and difficult spot of the Web data excavation area. In this paper, the author introduces a new, open source information collection tool: VietSpider, including system structure, core technology, case to proceed etc. The author also compares it with another tool (the Heritrix+HtmlParser combination) and analyzes the advantages and disadvantages of the two methods, which facilitates the selection and application of the users and researchers. And at last the author gives the solution to the garbage problem in the process of Chinese information acquisition.
  • Keywords
    Internet; data acquisition; graphical user interfaces; information retrieval; public domain software; storage management; Chinese information acquisition; Heritrix+HtmlParser combination; VietSpider; Web data excavation area; Web information extraction technology; core technology; garbage problem; graphical interface; network information extraction technology; open source information collection tool; system structure; Crawlers; Data mining; Databases; Encoding; Information filters; Heritrix+HtmlParser; Messy code; VietSpider; Web information extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Control Conference (CCC), 2013 32nd Chinese
  • Conference_Location
    Xi´an
  • Type

    conf

  • Filename
    6640939