A novel method for extracting entity data from Deep Web precisely

Author

Hai-tao Yu ; Jian-Yi Guo ; Zheng-Tao Yu ; Yan-Tuan Xian ; Xin Yan

Author_Institution

Sch. of Inf. Eng. & Autom., Kunming Univ. of Sci. & Technol., Kunming, China

fYear

2014

fDate

May 31 2014-June 2 2014

Firstpage

5049

Lastpage

5053

Abstract

In order to make better use of the hidden information value in the Deep Web, get fast and accurate access to the embedded entity data, this paper presented a method for extracting entity data from Deep Web precisely, designed a entity extraction system, which will extract data from Deep Web automatically. Firstly, designed a web crawler based on the characteristics of Deep Web, take advantage of the web crawler to get resources from Internet; Secondly, the pretreatment of web resources, normalize the pages which are non-standard; Finally, locate and extract the entity data from Deep Web accurately, in this paper, based on the hierarchy and layout features in DOM tree, combined XPath with RegExp to locate entity data, then stored the extracted entity attributes and attribute values. Experiments show that, using this method can locate and extract the entity data from Deep Web quickly and efficiently, and achieved a higher accuracy.

Keywords

Internet; document handling; information retrieval; DOM tree; Internet; RegExp; Web crawler; Web resources; XPath; attribute values; deep Web; document object model; embedded entity data access; entity attributes; entity data extraction; hidden information value; Crawlers; Data mining; Feature extraction; HTML; Standards; DOM; Deep Web; Entity Extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Control and Decision Conference (2014 CCDC), The 26th Chinese

Conference_Location

Changsha

Print_ISBN

978-1-4799-3707-3

Type

conf

DOI

10.1109/CCDC.2014.6853078

Filename

6853078