• DocumentCode
    2278774
  • Title

    A language independent text segmentation technique based on naive bayes classifier

  • Author

    Bidgoli, A.M. ; Boraghi, M.

  • Author_Institution
    North Tehran Branch, Islamic Azad Univ., Tehran, Iran
  • fYear
    2010
  • fDate
    15-17 Dec. 2010
  • Firstpage
    11
  • Lastpage
    16
  • Abstract
    One of the important stages for optical character recognition system is text components segmentation from non-text components of input images. In this paper a machine learning technique based on a naive bayes classifier is developed for text components segmentation. In training stage, a simple procedure is used to generate a large collection of training data sets for learning the classifier. A collection of manuscript and printed Persian and English pictorial Images that have been manually separated, have been used for training. A proper post-processing is applied to improve the segmentation results. Several representative document images scanned from Persian, English and Chinese handwritings and printed documents are employed to verify the effectiveness of the developed algorithm.
  • Keywords
    Bayes methods; character recognition; document image processing; image segmentation; learning (artificial intelligence); Chinese handwritings; English pictorial Images; Persian pictorial Images; document images; language independent text segmentation technique; machine learning technique; naive bayes classifier; nontext components; optical character recognition system; text components segmentation; training data sets; Classification algorithms; Equations; Image edge detection; Image segmentation; Mathematical model; Training; Training data; Documents Image Analyses; Naive Bayes Classifier; OCR; Text Segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal and Image Processing (ICSIP), 2010 International Conference on
  • Conference_Location
    Chennai
  • Print_ISBN
    978-1-4244-8595-6
  • Type

    conf

  • DOI
    10.1109/ICSIP.2010.5697433
  • Filename
    5697433