Extraction of Arabic text from multilingual documents

Author

Moalla, Ikram ; Elbaati, Abdelkarim ; Alimi, Adel M. ; Benhamadou, AbdelMajid

Author_Institution

REsearch Group on Intelligent Machines, Univ. of Sfax, Tunisia

Volume

4

fYear

2002

fDate

6-9 Oct. 2002

Abstract

This paper describes the processing of multilingual documents (Arabic/Latin), extracted from Arabic scientific articles whose displays pages contain Arabic lines which sometimes include one or more Latin words because they have no exact equivalent in Arabic. Processing these blocks we need to extract Arabic text from multilingual blocks. We propose an original method to locate Latin words from heterogeneous blocks. The method is based on a process of Arabic character recognition. This recognition is made by template matching that has been shown by tests to be efficient for the discrimination of Arabic and Latin script. Segment prototypes are extracted from main font styles used in the treated magazines. Results of the word discrimination adjoin the 100% on 30 blocks containing a total of 478 words.

Keywords

document image processing; image matching; optical character recognition; text analysis; Arabic character recognition; Arabic scientific articles; Arabic text extraction; Latin words; OCR; document analysis; font styles; heterogeneous blocks; multilingual documents; template matching; word discrimination; Character recognition; Displays; Feature extraction; Handwriting recognition; Machine intelligence; Optical character recognition software; Prototypes; Testing; Text analysis; Writing;

fLanguage

English

Publisher

ieee

Conference_Titel

Systems, Man and Cybernetics, 2002 IEEE International Conference on

ISSN

1062-922X

Print_ISBN

0-7803-7437-1

Type

conf

DOI

10.1109/ICSMC.2002.1173266

Filename

1173266