• DocumentCode
    3222350
  • Title

    Page layout analyser for multilingual Indian documents

  • Author

    Chaudhuri, A. Ray ; Mandal, A.K. ; Chaudhuri, B.B.

  • Author_Institution
    Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Kolkata, India
  • fYear
    2002
  • fDate
    13-15 Dec. 2002
  • Firstpage
    24
  • Lastpage
    32
  • Abstract
    An advanced Optical Character Recognition (OCR) system is equipped with the module of the page layout analyser. It separates textual zones from non-textual zones. It identifies textual blocks from multicolumn documents and groups them into homogenous regions in terms of geometric shape and spatial distribution. All existing OCR modules developed for various Indian scripts can handle text only single-column documents. In this paper, a page, layout analyser that uses typical common features present in most of the Indian scripts is introduced. A simple compatibility criterion that allows various degrees of homogeneity is defined. The page-analyser is robust in the sense that it can distinguish text regions from non-textual entities such as images, rulers, and noisy signals due to smudges and poor quality of the paper. Test results are shown in two most popular Indian Scripts, Devnagari (Hindi) and Bangla.
  • Keywords
    optical character recognition; Bangla; Devnagari; Hindi; advanced Optical Character Recognition system; compatibility criterion; multicolumn documents; multilingual Indian documents; page layout analyser; single-column documents; textual blocks; textual zones; Character recognition; Geometrical optics; Optical character recognition software; Optical sensors; Robustness; Shape; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Language Engineering Conference, 2002. Proceedings
  • Print_ISBN
    0-7695-1885-0
  • Type

    conf

  • DOI
    10.1109/LEC.2002.1182287
  • Filename
    1182287