Stochastic Arabic hybrid diacritizer

Author

Rashwan, Mohsen ; Attia, Mohamed ; Abdou, Sherif ; Abdou, S. ; Rafea, Ahmed

Author_Institution

Dept. of Electron. & Electr. Commun., Cairo Univ., Cairo, Egypt

fYear

2009

fDate

24-27 Sept. 2009

Firstpage

1

Lastpage

8

Abstract

This paper introduces a two-layer stochastic system to diacritize raw Arabic text automatically. The first layer determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A* lattice search algorithm and m-gram probability estimation. When full-form words are out-of-vocabulary (OOV), the system utilizes a second layer, which factorizes each Arabic word into its possible morphological constituents (prefix, root, pattern and suffix), then uses m-gram probability estimation and A* lattice search algorithm to select among the possible factorizations to get the most likely diacritization sequence. While the second layer has better coverage of possible Arabic forms, the first layer yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-end) diacritics. The presented hybrid system possesses the advantages of both layers. The paper details the workings of both layers and the architecture of the hybrid system. By comparing our proposed system with the best performing system to our knowledge of Habash et al. using their training and testing corpus; it is found that the word error rates of 5.5% for the morphological diacritization and 9.4% for the syntactic diacritization by Habash et al., and only 3.1% for the morphological diacritization and 9.4% for the syntactic diacritization by our system.

Keywords

learning (artificial intelligence); natural language processing; probability; search problems; stochastic processes; text analysis; A* lattice search algorithm; hybrid system; m-gram probability estimation; machine learning; maximum marginal probability; morphological constituent; morphological diacritization; out-of-vocabulary; stochastic Arabic hybrid diacritizer; syntactic diacritization; text analysis; Computer science; Lattices; Morphology; Speech synthesis; Stochastic processes; Stochastic systems; System testing; Tagging; Training data; Vocabulary;

fLanguage

English

Publisher

ieee

Conference_Titel

Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on

Conference_Location

Dalian

Print_ISBN

978-1-4244-4538-7

Electronic_ISBN

978-1-4244-4540-0

Type

conf

DOI

10.1109/NLPKE.2009.5313742

Filename

5313742