• DocumentCode
    2146573
  • Title

    Text Classification and Document Layout Analysis of Paper Fragments

  • Author

    Diem, Markus ; Kleber, Florian ; Sablatnig, Robert

  • Author_Institution
    Comput. Vision Lab., Vienna Univ. of Technol., Vienna, Austria
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    854
  • Lastpage
    858
  • Abstract
    In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.
  • Keywords
    document image processing; image classification; image segmentation; optical character recognition; pattern clustering; text analysis; back propagation; content writing; document image analysis methods; document layout analysis; document snippet clustering; manuscripts; optical character recognition system; paper fragment; printed text classification; Feature extraction; Image segmentation; Layout; Noise; Noise measurement; Optical character recognition software; Text analysis; layout analysis; local features; text classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.175
  • Filename
    6065432