Title :
Learning the characteristics of critical cells from web tables
Author_Institution :
Rensselaer Polytech. Inst., Troy, NY, USA
Abstract :
Critical Cells (CCs) are identified to partition a web table into mutually exclusive regions of stub, column header, row header, data, and neutral cells. Every table cell (including titles and footnotes outside the table proper but usually within the HTML table tags) is classified into one of six classes based on cell-features extracted from the target cell and its eight neighbors. Changing the domain of maximization over posteriors results in the assignment of exactly four CCs to each table. The average number of interactions required for error-free table data extraction can be reduced more than 75% by alternating between graphic interaction and auto-assignment.
Keywords :
Internet; feature extraction; learning (artificial intelligence); CC; Web tables; auto-assignment; cell-feature extraction; column header; critical cell characteristics; error-free table data extraction; graphic interaction; learning; neutral cells; row header; stub; Algorithm design and analysis; Data mining; Feature extraction; HTML; Training; Visualization;
Conference_Titel :
Pattern Recognition (ICPR), 2012 21st International Conference on
Conference_Location :
Tsukuba
Print_ISBN :
978-1-4673-2216-4