مرکز منطقه ای اطلاع رساني علوم و فناوري - Data Extraction from Web Tables: The Devil is in the Details

DocumentCode :

2143328

Title :

Data Extraction from Web Tables: The Devil is in the Details

Author :

Nagy, G. ; Seth, Sachin ; Dongpu Jin ; Embley, David W. ; Machado, S. ; Krishnamoorthy, Mohan

Author_Institution :

Electr., Comput., & Syst. Eng., Rensselaer Polytech. Inst., Troy, NY, USA

fYear :

2011

fDate :

18-21 Sept. 2011

Firstpage :

242

Lastpage :

246

Abstract :

We present a method based on header paths for efficient and complete extraction of labeled data from tables meant for humans. Although many table configurations yield to the proposed syntactic analysis, some require access to semantic knowledge. Clicking on one or two critical cells per table, through a simple interface, is sufficient to resolve most of these problem tables. Header paths, a purely syntactic representation of visual tables, can be transformed ("factored") into existing representations of structured data such as category trees, relational tables, and RDF triples. From a random sample of 200 web tables from ten large statistical web sites, we generated 376 relational tables and 34,110 subject-predicate-object RDF triples.

Keywords :

Web sites; data mining; semantic Web; Web tables; data extraction; header paths; relational tables; semantic knowledge; statistical Web sites; subject-predicate-object RDF triples; syntactic analysis; syntactic representation; Data mining; Educational institutions; HTML; Indexing; Resource description framework; Web sites; RDF; header-paths; relational table; visual table;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition (ICDAR), 2011 International Conference on

Conference_Location :

Beijing

ISSN :

1520-5363

Print_ISBN :

978-1-4577-1350-7

Electronic_ISBN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2011.57

Filename :

6065312

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2143328