Encoding of modified X-Y trees for document classification

Author

Cesarini, Francesca ; Lastri, Marco ; Marinai, Simone ; Soda, Giovanni

Author_Institution

Dipt. di Sistemi e Inf., Firenze Univ., Florence, Italy

fYear

2001

fDate

6/23/1905 12:00:00 AM

Firstpage

1131

Lastpage

1136

Abstract

Describes a method for classifying document images on the basis of their physical layout. The layout is described by means of a hierarchical description, the modified X-Y tree, that is derived from the classical X-Y tree segmentation algorithm taking into account cuts along lines in addition to cuts along white spaces between blocks. In order to reduce problems due to noise and the skew of the input image, the modified X-Y tree is built on top of regions extracted by a commercial OCR package. The tree is afterwards coded into a fixed-size representation that takes into account occurrences of tree patterns in the tree representing the page. Lastly, this feature vector is fed to an artificial neural network that is trained to classify document images. The system is applied to the classification of documents belonging to digital libraries. Examples of classes taken into account are "title page", "index" and "regular page". Many tests have been carried out on a data set of more than 600 pages from an online digital library. These tests allowed us to conclude that the use of modified X-Y trees is advantageous with respect to the classical X-Y decomposition for this classification task

Keywords

digital libraries; document image processing; image classification; image coding; image segmentation; neural nets; optical character recognition; tree codes; tree data structures; OCR; X-Y decomposition; X-Y tree segmentation algorithm; artificial neural network training; cuts; document image classification; document physical layout; feature vector; fixed-size representation; hierarchical description; indexes; modified X-Y tree encoding; noise; online digital library; page classification; regular pages; skew; title pages; white spaces; Artificial neural networks; Classification tree analysis; Encoding; Image segmentation; Noise reduction; Optical character recognition software; Packaging; Software libraries; Testing; White spaces;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on

Conference_Location

Seattle, WA

Print_ISBN

0-7695-1263-1

Type

conf

DOI

10.1109/ICDAR.2001.953962

Filename

953962