DocumentCode
2146573
Title
Text Classification and Document Layout Analysis of Paper Fragments
Author
Diem, Markus ; Kleber, Florian ; Sablatnig, Robert
Author_Institution
Comput. Vision Lab., Vienna Univ. of Technol., Vienna, Austria
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
854
Lastpage
858
Abstract
In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.
Keywords
document image processing; image classification; image segmentation; optical character recognition; pattern clustering; text analysis; back propagation; content writing; document image analysis methods; document layout analysis; document snippet clustering; manuscripts; optical character recognition system; paper fragment; printed text classification; Feature extraction; Image segmentation; Layout; Noise; Noise measurement; Optical character recognition software; Text analysis; layout analysis; local features; text classification;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.175
Filename
6065432
Link To Document