• DocumentCode
    3037190
  • Title

    Effectively recognizing broken characters in Historical documents

  • Author

    Sumetphong, Chaivatna ; Tangwongsan, Supachai

  • Author_Institution
    Fac. of Inf. & Commun. Technol., Mahidol Univ., Bangkok, Thailand
  • Volume
    3
  • fYear
    2012
  • fDate
    25-27 May 2012
  • Firstpage
    104
  • Lastpage
    108
  • Abstract
    Historical documents, after being binarized, produce images that contain abundant broken pieces. The presence of these broken pieces naturally complicates the process of OCR and drastically drops the overall recognition accuracy. We propose a highly effective approach to recognize the broken characters using a heuristic enumerative method to find the optimal set partition of the broken pieces. Each subset of the optimal partition is mapped to the best character pattern and the overall image is recognized. Results obtained after performing experiments on a Thai Historical document and an American Historical document are quite promising. Given the generality of the method, it may be applicable to different language scripts given that a properly trained classifier has been developed for that script and font.
  • Keywords
    document image processing; history; optical character recognition; American historical document; OCR; Thai historical document; broken character recognition; broken pieces; heuristic enumerative method; historical documents; language scripts; optimal partition subset; recognition accuracy; trained classifier; Accuracy; Character recognition; Hidden Markov models; Image segmentation; Optical character recognition software; Partitioning algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Automation Engineering (CSAE), 2012 IEEE International Conference on
  • Conference_Location
    Zhangjiajie
  • Print_ISBN
    978-1-4673-0088-9
  • Type

    conf

  • DOI
    10.1109/CSAE.2012.6272918
  • Filename
    6272918