DocumentCode
773405
Title
A scalable hybrid approach for extracting head components from Web tables
Author
Jung, Sung-Won ; Kwon, Hyuk-Chul
Author_Institution
Dept. of Comput. Sci. & Eng., Pusan Nat. Univ., South Korea
Volume
18
Issue
2
fYear
2006
Firstpage
174
Lastpage
187
Abstract
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.
Keywords
Internet; data analysis; data mining; decision trees; hypermedia markup languages; pattern classification; table lookup; text analysis; F-measure; HTML document; Internet table; Web table; classification model; decision tree; document design; information extraction; knowledge structuring; table head component extraction; table mining; text mining; Abstracts; Classification tree analysis; Data mining; Decision trees; HTML; Information analysis; Internet; Natural languages; Text mining; Training data; Index Terms- Text mining; information extraction; table mining.;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2006.19
Filename
1563981
Link To Document