• DocumentCode
    3085222
  • Title

    Extraction of Arabic text from multilingual documents

  • Author

    Moalla, Ikram ; Elbaati, Abdelkarim ; Alimi, Adel M. ; Benhamadou, AbdelMajid

  • Author_Institution
    REsearch Group on Intelligent Machines, Univ. of Sfax, Tunisia
  • Volume
    4
  • fYear
    2002
  • fDate
    6-9 Oct. 2002
  • Abstract
    This paper describes the processing of multilingual documents (Arabic/Latin), extracted from Arabic scientific articles whose displays pages contain Arabic lines which sometimes include one or more Latin words because they have no exact equivalent in Arabic. Processing these blocks we need to extract Arabic text from multilingual blocks. We propose an original method to locate Latin words from heterogeneous blocks. The method is based on a process of Arabic character recognition. This recognition is made by template matching that has been shown by tests to be efficient for the discrimination of Arabic and Latin script. Segment prototypes are extracted from main font styles used in the treated magazines. Results of the word discrimination adjoin the 100% on 30 blocks containing a total of 478 words.
  • Keywords
    document image processing; image matching; optical character recognition; text analysis; Arabic character recognition; Arabic scientific articles; Arabic text extraction; Latin words; OCR; document analysis; font styles; heterogeneous blocks; multilingual documents; template matching; word discrimination; Character recognition; Displays; Feature extraction; Handwriting recognition; Machine intelligence; Optical character recognition software; Prototypes; Testing; Text analysis; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man and Cybernetics, 2002 IEEE International Conference on
  • ISSN
    1062-922X
  • Print_ISBN
    0-7803-7437-1
  • Type

    conf

  • DOI
    10.1109/ICSMC.2002.1173266
  • Filename
    1173266