DocumentCode :
2146573
Title :
Text Classification and Document Layout Analysis of Paper Fragments
Author :
Diem, Markus ; Kleber, Florian ; Sablatnig, Robert
Author_Institution :
Comput. Vision Lab., Vienna Univ. of Technol., Vienna, Austria
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
854
Lastpage :
858
Abstract :
In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.
Keywords :
document image processing; image classification; image segmentation; optical character recognition; pattern clustering; text analysis; back propagation; content writing; document image analysis methods; document layout analysis; document snippet clustering; manuscripts; optical character recognition system; paper fragment; printed text classification; Feature extraction; Image segmentation; Layout; Noise; Noise measurement; Optical character recognition software; Text analysis; layout analysis; local features; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.175
Filename :
6065432
Link To Document :
بازگشت