Title :
Extracting structured data from Web pages (Poster)
Author :
Arasu, Arvind ; Garcia-Molina, Hector
Author_Institution :
Stanford Univ., CA, USA
Abstract :
Many Web sites contain a large collection of "structured" Web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. Our goal is to automatically extract structured data from a collection of pages described above, without any human input like manually generated rules or training sets. Extracting structured data gives us greater querying power over the data and is useful in information integration systems. Our approach consists of two stages. In the first stage, the unknown template used to create the pages is deduced. In the second stage, the deduced template is used to extract the values. We focus on the first stage since it is more challenging. The full version contains formal definition of high occurrence correlation and our algorithm. We evaluated our approach by considering 9 real collections of pages.
Keywords :
Internet; Web sites; query formulation; query processing; Internet; Web pages; Web sites; formal specification; high occurrence correlation; structured data extraction; unknown template deduction; Algorithm design and analysis; Atomic measurements; Books; Character generation; Data mining; Encoding; Humans; Web pages;
Conference_Titel :
Data Engineering, 2003. Proceedings. 19th International Conference on
Print_ISBN :
0-7803-7665-X
DOI :
10.1109/ICDE.2003.1260839