مرکز منطقه ای اطلاع رساني علوم و فناوري - Extracting structured data from Web pages (Poster)

DocumentCode :

2507603

Title :

Extracting structured data from Web pages (Poster)

Author :

Arasu, Arvind ; Garcia-Molina, Hector

Author_Institution :

Stanford Univ., CA, USA

fYear :

2003

fDate :

5-8 March 2003

Firstpage :

698

Abstract :

Many Web sites contain a large collection of "structured" Web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. Our goal is to automatically extract structured data from a collection of pages described above, without any human input like manually generated rules or training sets. Extracting structured data gives us greater querying power over the data and is useful in information integration systems. Our approach consists of two stages. In the first stage, the unknown template used to create the pages is deduced. In the second stage, the deduced template is used to extract the values. We focus on the first stage since it is more challenging. The full version contains formal definition of high occurrence correlation and our algorithm. We evaluated our approach by considering 9 real collections of pages.

Keywords :

Internet; Web sites; query formulation; query processing; Internet; Web pages; Web sites; formal specification; high occurrence correlation; structured data extraction; unknown template deduction; Algorithm design and analysis; Atomic measurements; Books; Character generation; Data mining; Encoding; Humans; Web pages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Engineering, 2003. Proceedings. 19th International Conference on

Print_ISBN :

0-7803-7665-X

Type :

conf

DOI :

10.1109/ICDE.2003.1260839

Filename :

1260839

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2507603