Document Content Extraction Using Automatically Discovered Features

Author

Wang, Sui-Yu ; Baird, Henry S. ; An, Chang

Author_Institution

Comput. Sci. & Eng. Dept., Lehigh Univ., Bethlehem, PA, USA

fYear

2009

Firstpage

1076

Lastpage

1080

Abstract

We report an automatic feature discovery method that achieves results comparable to a manually chosen, larger feature set on a document image content extraction problem: the location and segmentation of regions containing handwriting and machine-printed text in documents images. This approach is a greedy forward selection algorithm that iteratively constructs one linear feature at a time. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. We conducted experiments on 87 diverse test images. Four manually chosen linear features with an error rate of 16.2% were given to the algorithm; the algorithm then found an additional ten features; the composite 14 features achieve an error rate of 13.8%. This outperforms a feature set of size 14 chosen by principal component analysis (PCA) with an error rate of 15.4%. It also nearly matches the error rate of 13.6% achieved by twice as many manually chosen features. Thus our algorithm appears to compete with both the widely used PCA method and tedious and expensive trial-and-error manual exploration.

Keywords

document image processing; feature extraction; greedy algorithms; handwritten character recognition; image classification; image segmentation; iterative methods; principal component analysis; text analysis; PCA; automatic feature discovery; document image content extraction; feature mapping; greedy forward selection algorithm; handwriting-machine-printed text; image classification; iterative algorithm; principal component analysis; region location; region segmentation; Clustering algorithms; Error analysis; Filters; Handwriting recognition; Image analysis; Iterative algorithms; Null space; Principal component analysis; Testing; Text analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on

Conference_Location

Barcelona

ISSN

1520-5363

Print_ISBN

978-1-4244-4500-4

Electronic_ISBN

1520-5363

Type

conf

DOI

10.1109/ICDAR.2009.198

Filename

5277493