Title :
Extraction of Arabic text from multilingual documents
Author :
Moalla, Ikram ; Elbaati, Abdelkarim ; Alimi, Adel M. ; Benhamadou, AbdelMajid
Author_Institution :
REsearch Group on Intelligent Machines, Univ. of Sfax, Tunisia
Abstract :
This paper describes the processing of multilingual documents (Arabic/Latin), extracted from Arabic scientific articles whose displays pages contain Arabic lines which sometimes include one or more Latin words because they have no exact equivalent in Arabic. Processing these blocks we need to extract Arabic text from multilingual blocks. We propose an original method to locate Latin words from heterogeneous blocks. The method is based on a process of Arabic character recognition. This recognition is made by template matching that has been shown by tests to be efficient for the discrimination of Arabic and Latin script. Segment prototypes are extracted from main font styles used in the treated magazines. Results of the word discrimination adjoin the 100% on 30 blocks containing a total of 478 words.
Keywords :
document image processing; image matching; optical character recognition; text analysis; Arabic character recognition; Arabic scientific articles; Arabic text extraction; Latin words; OCR; document analysis; font styles; heterogeneous blocks; multilingual documents; template matching; word discrimination; Character recognition; Displays; Feature extraction; Handwriting recognition; Machine intelligence; Optical character recognition software; Prototypes; Testing; Text analysis; Writing;
Conference_Titel :
Systems, Man and Cybernetics, 2002 IEEE International Conference on
Print_ISBN :
0-7803-7437-1
DOI :
10.1109/ICSMC.2002.1173266