Web Data Extraction Based on Simple Tree Matching

Author

Wang, Hua ; Zhang, Yang

Author_Institution

Coll. of Inf. Eng., Northwest A&F Univ., Yangling, China

Volume

2

fYear

2010

fDate

14-15 Aug. 2010

Firstpage

15

Lastpage

18

Abstract

The information on the Internet has been grown exponentially, the Internet users are overwhelmed by these information. How to automatically extract useful information from the relevant pages, so as to provide a convenient and rapid information query platform for the users, is an important issue. In this paper, based on simple tree matching algorithm, we present a Web data extraction method based on simple tree matching by analyzing the structure and content of Web documents. Experimental results on Web data from several famous websites show that the proposed Web data extraction method can effectively extract data records from similar Web pages, with extraction precision reached about 90%, and can meet the requirement of extracting accurate data in real-life applications.

Keywords

Web services; data mining; query processing; trees (mathematics); Internet; Web data extraction method; Web documents; Web pages; Web sites; information query platform; simple tree matching algorithm; Artificial intelligence; Books; Data mining; Feature extraction; HTML; Heuristic algorithms; Web pages; DOM; Information Extraction; Simple tree matching; XPath;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Engineering (ICIE), 2010 WASE International Conference on

Conference_Location

Beidaihe, Hebei

Print_ISBN

978-1-4244-7506-3

Electronic_ISBN

978-1-4244-7507-0

Type

conf

DOI

10.1109/ICIE.2010.100

Filename

5571205