• DocumentCode
    2015956
  • Title

    Table Recognition and Understanding from PDF Files

  • Author

    Hassan, Tamir ; Baumgartner, Robert

  • Author_Institution
    Vienna Univ. of Technol., Vienna
  • Volume
    2
  • fYear
    2007
  • fDate
    23-26 Sept. 2007
  • Firstpage
    1143
  • Lastpage
    1147
  • Abstract
    We propose a flexible method for detecting and understanding tables in PDF files, which is not reliant upon one particular feature being present, for example ruling lines or indentations, and is therefore applicable to a wide variety of visual presentations. We describe the steps required in transforming the low-level PDF instructions into text segments, lines and boxes on a page. We propose three different classifications for published tables, and develop methods to detect these tables and correctly identify their respective rows and columns. We also explain how to recognize spanning rows and columns, and multi-line rows. Experimental results show that our algorithm is effective in converting a wide variety of tabular presentations into HTML for information extraction purposes.
  • Keywords
    image recognition; image segmentation; text analysis; PDF file; low-level PDF instructions; multiline rows; published tables; table classifications; table recognition; text segment; visual presentation; Data mining; HTML; Humans; Image segmentation; Information systems; Manuals; Natural languages; Printing; Tagging; Wrapping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
  • Conference_Location
    Parana
  • ISSN
    1520-5363
  • Print_ISBN
    978-0-7695-2822-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.2007.4377094
  • Filename
    4377094