DocumentCode :
153380
Title :
End-to-End Conversion of HTML Tables for Populating a Relational Database
Author :
Nagy, G. ; Embley, David W. ; Seth, Sachin
Author_Institution :
Rensselaer Polytech. Inst., Troy, NY, USA
fYear :
2014
fDate :
7-10 April 2014
Firstpage :
222
Lastpage :
226
Abstract :
Automating the conversion of human-readable HTML tables into machine-readable relational tables will enable end-user query processing of the millions of data tables found on the web. Theoretically sound and experimentally successful methods for index-based segmentation, extraction of category hierarchies, and construction of a canonical table suitable for direct input to a relational database are demonstrated on 200 heterogeneous web tables. The methods are scalable: the program generates the 198 Access compatible CSV files in ~0.1s per table (two tables could not be indexed).
Keywords :
Internet; hypermedia markup languages; indexing; query processing; relational databases; Access compatible CSV files; World Wide Web; canonical table; category hierarchy extraction; end-to-end conversion; end-user query processing; heterogeneous Web tables; human-readable HTML tables; index-based segmentation; machine-readable relational tables; relational database; Educational institutions; HTML; Indexing; Layout; Text analysis; Wang category; canonical relational table; header cross-product; header factoring; table index; table segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
Type :
conf
DOI :
10.1109/DAS.2014.9
Filename :
6831002
Link To Document :
بازگشت