DocumentCode :
2489504
Title :
A robust technique for text extraction in mixed-type binary documents
Author :
Strouthopoulos, Charalambos ; Nikolaidis, Athanasios
Author_Institution :
Dept. of Inf. & Commun., Technol. Educ. Inst. of Serres, Serres
fYear :
2008
fDate :
8-11 Dec. 2008
Firstpage :
1
Lastpage :
4
Abstract :
A crucial preprocessing stage in applications such as OCR is text extraction from mixed-type documents. The present work, in contrast to most until now, successfully faces the problem of varying text orientation and size. The technique first identifies marks using a contour following technique, followed by a PCA (principal component analyzer) which determines the direction of the main axis of each mark. Next, a nearest-neighbor technique is employed to find the shortest distances between marks, and a feature vector is formed based on calculated mark dimensions and distances, which is then fed into a SOFM (self organizing feature map) which defines homogeneous mark clusters. Resulting cluster weights and variances are used to form a set of fuzzy rules, and a fuzzy classification scheme identifies marks as characters or non-characters. The technique succeeds in correctly and quickly extracting text areas in a variety of mixed-type documents.
Keywords :
document image processing; feature extraction; fuzzy reasoning; image classification; image segmentation; pattern clustering; principal component analysis; self-organising feature maps; text analysis; feature vector; fuzzy classification scheme; fuzzy rule set; homogeneous mark cluster; mixed-type binary document; nearest-neighbor technique; principal component analyzer; robust technique; self organizing feature map; text extraction; text orientation; Communications technology; Educational technology; Feature extraction; Fuzzy sets; Informatics; Karhunen-Loeve transforms; Nearest neighbor searches; Optical character recognition software; Principal component analysis; Robustness;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition, 2008. ICPR 2008. 19th International Conference on
Conference_Location :
Tampa, FL
ISSN :
1051-4651
Print_ISBN :
978-1-4244-2174-9
Electronic_ISBN :
1051-4651
Type :
conf
DOI :
10.1109/ICPR.2008.4761820
Filename :
4761820
Link To Document :
بازگشت