Title :
Structured and semantic data extraction from Web pages
Author :
Gan, Yong ; Zhang, Su-Zhi
Author_Institution :
Sch. of Electron. & Inf. Eng., Xi´´an Jiaotong Univ., China
Abstract :
With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, Web pages in HTML must be converted into a format meaningful software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and learning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.
Keywords :
Internet; Web sites; XML; information retrieval; HTML document; Internet; Web pages; World Wide Web; XML document; XML files; document type definition; learning algorithm; semantic data extraction; software programs; structured data extraction; wrappers technique; Data mining; Distributed databases; HTML; Humans; Induction generators; Internet; Object oriented databases; Relational databases; Web pages; XML;
Conference_Titel :
Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on
Print_ISBN :
0-7803-8403-2
DOI :
10.1109/ICMLC.2004.1378533