DocumentCode :
2028300
Title :
A novel approach for Web data extraction based on XML encoding
Author :
Nie, Tiezheng ; Shen, Derong ; Yu, Ge ; Shi, Zhong
Author_Institution :
Key Lab. of Med. Image Comput., Northeastern Univ., Shenyang, China
Volume :
5
fYear :
2010
fDate :
10-12 Aug. 2010
Firstpage :
2417
Lastpage :
2421
Abstract :
The problem of extracting data from a Web page has been studied by many works. In this paper, we present a novel approach that extracts data records from Web pages based on techniques of XML encoding. Firstly, our approach formats a given Web data page into an XML document. Then instead of using DOM-based approaches, we make use of XML encoding model to transform the XML document into a linear sequence. Our algorithm identifies the data records of a Web page from the sequence, which avoids the complex matching between sub trees in DOM model. Moreover, we address the problem of repetitive subparts in records and propose an algorithm for data alignment. Experimental results show that our approach can extract data records accurately from web pages.
Keywords :
Web sites; XML; encoding; DOM-based approach; Web data extraction; Web pages; XML document; XML encoding; data alignment; linear sequence; subtrees; Algorithm design and analysis; Data mining; Encoding; Feature extraction; HTML; Web pages; XML; Web data; data extraction; xml encoding;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
Conference_Location :
Yantai, Shandong
Print_ISBN :
978-1-4244-5931-5
Type :
conf
DOI :
10.1109/FSKD.2010.5569297
Filename :
5569297
Link To Document :
بازگشت