مرکز منطقه ای اطلاع رساني علوم و فناوري - Structured Data Extraction from the Web Based on Partial Tree Alignment

DocumentCode :

802521

Title :

Structured Data Extraction from the Web Based on Partial Tree Alignment

Author :

Zhai, Yanhong ; Liu, Bing

Author_Institution :

Dept. of Comput. Sci., Illinois Univ., Chicago, IL

Volume :

Issue :

fYear :

2006

Firstpage :

1614

Lastpage :

1628

Abstract :

This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective

Keywords :

Internet; database management systems; information retrieval; learning (artificial intelligence); storage management; tree data structures; Web mining; Web pages; automatic pattern discovery; data records; database; machine learning techniques; partial tree alignment; structured Web data extraction; tree matching; visual information; Books; Data mining; Databases; HTML; Humans; Information retrieval; Labeling; Machine learning; Web mining; Web pages; Web data extraction; Web mining.; partial tree alignement; wrapper generation;

fLanguage :

English

Journal_Title :

Knowledge and Data Engineering, IEEE Transactions on

Publisher :

ieee

ISSN :

1041-4347

Type :

jour

DOI :

10.1109/TKDE.2006.197

Filename :

1717419

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=802521