DocumentCode
2259160
Title
Stochastic Arabic hybrid diacritizer
Author
Rashwan, Mohsen ; Attia, Mohamed ; Abdou, Sherif ; Abdou, S. ; Rafea, Ahmed
Author_Institution
Dept. of Electron. & Electr. Commun., Cairo Univ., Cairo, Egypt
fYear
2009
fDate
24-27 Sept. 2009
Firstpage
1
Lastpage
8
Abstract
This paper introduces a two-layer stochastic system to diacritize raw Arabic text automatically. The first layer determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A* lattice search algorithm and m-gram probability estimation. When full-form words are out-of-vocabulary (OOV), the system utilizes a second layer, which factorizes each Arabic word into its possible morphological constituents (prefix, root, pattern and suffix), then uses m-gram probability estimation and A* lattice search algorithm to select among the possible factorizations to get the most likely diacritization sequence. While the second layer has better coverage of possible Arabic forms, the first layer yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-end) diacritics. The presented hybrid system possesses the advantages of both layers. The paper details the workings of both layers and the architecture of the hybrid system. By comparing our proposed system with the best performing system to our knowledge of Habash et al. using their training and testing corpus; it is found that the word error rates of 5.5% for the morphological diacritization and 9.4% for the syntactic diacritization by Habash et al., and only 3.1% for the morphological diacritization and 9.4% for the syntactic diacritization by our system.
Keywords
learning (artificial intelligence); natural language processing; probability; search problems; stochastic processes; text analysis; A* lattice search algorithm; hybrid system; m-gram probability estimation; machine learning; maximum marginal probability; morphological constituent; morphological diacritization; out-of-vocabulary; stochastic Arabic hybrid diacritizer; syntactic diacritization; text analysis; Computer science; Lattices; Morphology; Speech synthesis; Stochastic processes; Stochastic systems; System testing; Tagging; Training data; Vocabulary;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location
Dalian
Print_ISBN
978-1-4244-4538-7
Electronic_ISBN
978-1-4244-4540-0
Type
conf
DOI
10.1109/NLPKE.2009.5313742
Filename
5313742
Link To Document