• DocumentCode
    1585237
  • Title

    Encoding of modified X-Y trees for document classification

  • Author

    Cesarini, Francesca ; Lastri, Marco ; Marinai, Simone ; Soda, Giovanni

  • Author_Institution
    Dipt. di Sistemi e Inf., Firenze Univ., Florence, Italy
  • fYear
    2001
  • fDate
    6/23/1905 12:00:00 AM
  • Firstpage
    1131
  • Lastpage
    1136
  • Abstract
    Describes a method for classifying document images on the basis of their physical layout. The layout is described by means of a hierarchical description, the modified X-Y tree, that is derived from the classical X-Y tree segmentation algorithm taking into account cuts along lines in addition to cuts along white spaces between blocks. In order to reduce problems due to noise and the skew of the input image, the modified X-Y tree is built on top of regions extracted by a commercial OCR package. The tree is afterwards coded into a fixed-size representation that takes into account occurrences of tree patterns in the tree representing the page. Lastly, this feature vector is fed to an artificial neural network that is trained to classify document images. The system is applied to the classification of documents belonging to digital libraries. Examples of classes taken into account are "title page", "index" and "regular page". Many tests have been carried out on a data set of more than 600 pages from an online digital library. These tests allowed us to conclude that the use of modified X-Y trees is advantageous with respect to the classical X-Y decomposition for this classification task
  • Keywords
    digital libraries; document image processing; image classification; image coding; image segmentation; neural nets; optical character recognition; tree codes; tree data structures; OCR; X-Y decomposition; X-Y tree segmentation algorithm; artificial neural network training; cuts; document image classification; document physical layout; feature vector; fixed-size representation; hierarchical description; indexes; modified X-Y tree encoding; noise; online digital library; page classification; regular pages; skew; title pages; white spaces; Artificial neural networks; Classification tree analysis; Encoding; Image segmentation; Noise reduction; Optical character recognition software; Packaging; Software libraries; Testing; White spaces;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
  • Conference_Location
    Seattle, WA
  • Print_ISBN
    0-7695-1263-1
  • Type

    conf

  • DOI
    10.1109/ICDAR.2001.953962
  • Filename
    953962