• DocumentCode
    1632692
  • Title

    Document Content Extraction Using Automatically Discovered Features

  • Author

    Wang, Sui-Yu ; Baird, Henry S. ; An, Chang

  • Author_Institution
    Comput. Sci. & Eng. Dept., Lehigh Univ., Bethlehem, PA, USA
  • fYear
    2009
  • Firstpage
    1076
  • Lastpage
    1080
  • Abstract
    We report an automatic feature discovery method that achieves results comparable to a manually chosen, larger feature set on a document image content extraction problem: the location and segmentation of regions containing handwriting and machine-printed text in documents images. This approach is a greedy forward selection algorithm that iteratively constructs one linear feature at a time. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. We conducted experiments on 87 diverse test images. Four manually chosen linear features with an error rate of 16.2% were given to the algorithm; the algorithm then found an additional ten features; the composite 14 features achieve an error rate of 13.8%. This outperforms a feature set of size 14 chosen by principal component analysis (PCA) with an error rate of 15.4%. It also nearly matches the error rate of 13.6% achieved by twice as many manually chosen features. Thus our algorithm appears to compete with both the widely used PCA method and tedious and expensive trial-and-error manual exploration.
  • Keywords
    document image processing; feature extraction; greedy algorithms; handwritten character recognition; image classification; image segmentation; iterative methods; principal component analysis; text analysis; PCA; automatic feature discovery; document image content extraction; feature mapping; greedy forward selection algorithm; handwriting-machine-printed text; image classification; iterative algorithm; principal component analysis; region location; region segmentation; Clustering algorithms; Error analysis; Filters; Handwriting recognition; Image analysis; Iterative algorithms; Null space; Principal component analysis; Testing; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4244-4500-4
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2009.198
  • Filename
    5277493