• DocumentCode
    773405
  • Title

    A scalable hybrid approach for extracting head components from Web tables

  • Author

    Jung, Sung-Won ; Kwon, Hyuk-Chul

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Pusan Nat. Univ., South Korea
  • Volume
    18
  • Issue
    2
  • fYear
    2006
  • Firstpage
    174
  • Lastpage
    187
  • Abstract
    We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.
  • Keywords
    Internet; data analysis; data mining; decision trees; hypermedia markup languages; pattern classification; table lookup; text analysis; F-measure; HTML document; Internet table; Web table; classification model; decision tree; document design; information extraction; knowledge structuring; table head component extraction; table mining; text mining; Abstracts; Classification tree analysis; Data mining; Decision trees; HTML; Information analysis; Internet; Natural languages; Text mining; Training data; Index Terms- Text mining; information extraction; table mining.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.19
  • Filename
    1563981