DocumentCode :
2507603
Title :
Extracting structured data from Web pages (Poster)
Author :
Arasu, Arvind ; Garcia-Molina, Hector
Author_Institution :
Stanford Univ., CA, USA
fYear :
2003
fDate :
5-8 March 2003
Firstpage :
698
Abstract :
Many Web sites contain a large collection of "structured" Web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. Our goal is to automatically extract structured data from a collection of pages described above, without any human input like manually generated rules or training sets. Extracting structured data gives us greater querying power over the data and is useful in information integration systems. Our approach consists of two stages. In the first stage, the unknown template used to create the pages is deduced. In the second stage, the deduced template is used to extract the values. We focus on the first stage since it is more challenging. The full version contains formal definition of high occurrence correlation and our algorithm. We evaluated our approach by considering 9 real collections of pages.
Keywords :
Internet; Web sites; query formulation; query processing; Internet; Web pages; Web sites; formal specification; high occurrence correlation; structured data extraction; unknown template deduction; Algorithm design and analysis; Atomic measurements; Books; Character generation; Data mining; Encoding; Humans; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2003. Proceedings. 19th International Conference on
Print_ISBN :
0-7803-7665-X
Type :
conf
DOI :
10.1109/ICDE.2003.1260839
Filename :
1260839
Link To Document :
بازگشت