• DocumentCode
    177172
  • Title

    A novel method for extracting entity data from Deep Web precisely

  • Author

    Hai-tao Yu ; Jian-Yi Guo ; Zheng-Tao Yu ; Yan-Tuan Xian ; Xin Yan

  • Author_Institution
    Sch. of Inf. Eng. & Autom., Kunming Univ. of Sci. & Technol., Kunming, China
  • fYear
    2014
  • fDate
    May 31 2014-June 2 2014
  • Firstpage
    5049
  • Lastpage
    5053
  • Abstract
    In order to make better use of the hidden information value in the Deep Web, get fast and accurate access to the embedded entity data, this paper presented a method for extracting entity data from Deep Web precisely, designed a entity extraction system, which will extract data from Deep Web automatically. Firstly, designed a web crawler based on the characteristics of Deep Web, take advantage of the web crawler to get resources from Internet; Secondly, the pretreatment of web resources, normalize the pages which are non-standard; Finally, locate and extract the entity data from Deep Web accurately, in this paper, based on the hierarchy and layout features in DOM tree, combined XPath with RegExp to locate entity data, then stored the extracted entity attributes and attribute values. Experiments show that, using this method can locate and extract the entity data from Deep Web quickly and efficiently, and achieved a higher accuracy.
  • Keywords
    Internet; document handling; information retrieval; DOM tree; Internet; RegExp; Web crawler; Web resources; XPath; attribute values; deep Web; document object model; embedded entity data access; entity attributes; entity data extraction; hidden information value; Crawlers; Data mining; Feature extraction; HTML; Standards; DOM; Deep Web; Entity Extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Control and Decision Conference (2014 CCDC), The 26th Chinese
  • Conference_Location
    Changsha
  • Print_ISBN
    978-1-4799-3707-3
  • Type

    conf

  • DOI
    10.1109/CCDC.2014.6853078
  • Filename
    6853078