• DocumentCode
    3136279
  • Title

    Layout Analysis for Arabic Historical Document Images Using Machine Learning

  • Author

    Bukhari, Syed Saqib ; Breuel, Thomas M. ; Asi, Abedelkadir ; El-Sana, Jihad

  • Author_Institution
    Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
  • fYear
    2012
  • fDate
    18-20 Sept. 2012
  • Firstpage
    639
  • Lastpage
    644
  • Abstract
    Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format. Simple and discriminative features are extracted in a connected-component level and subsequently robust feature vectors are generated. Multilayer perception classifier is exploited to classify connected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmentation and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of complex side-notes layout formats, achieving a segmentation accuracy of about 95%.
  • Keywords
    document image processing; feature extraction; image classification; image segmentation; learning (artificial intelligence); natural languages; text analysis; Arabic historical document images; block segmentation; complex layout format; complex side-notes layout formats; connected components classification; connected-component level; discriminative feature extraction; document image understanding system; machine learning; manuscripts; multilayer perception classifier; page layout analysis; page margins; pixel level analysis; robust feature vectors generation; state-of-the-art segmentation approach; text class; text segments; voting scheme; Accuracy; Context; Feature extraction; Image segmentation; Layout; Shape; Training; historical manuscripts; layout analysis; machine learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on
  • Conference_Location
    Bari
  • Print_ISBN
    978-1-4673-2262-1
  • Type

    conf

  • DOI
    10.1109/ICFHR.2012.227
  • Filename
    6424468