• DocumentCode
    2259160
  • Title

    Stochastic Arabic hybrid diacritizer

  • Author

    Rashwan, Mohsen ; Attia, Mohamed ; Abdou, Sherif ; Abdou, S. ; Rafea, Ahmed

  • Author_Institution
    Dept. of Electron. & Electr. Commun., Cairo Univ., Cairo, Egypt
  • fYear
    2009
  • fDate
    24-27 Sept. 2009
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    This paper introduces a two-layer stochastic system to diacritize raw Arabic text automatically. The first layer determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A* lattice search algorithm and m-gram probability estimation. When full-form words are out-of-vocabulary (OOV), the system utilizes a second layer, which factorizes each Arabic word into its possible morphological constituents (prefix, root, pattern and suffix), then uses m-gram probability estimation and A* lattice search algorithm to select among the possible factorizations to get the most likely diacritization sequence. While the second layer has better coverage of possible Arabic forms, the first layer yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-end) diacritics. The presented hybrid system possesses the advantages of both layers. The paper details the workings of both layers and the architecture of the hybrid system. By comparing our proposed system with the best performing system to our knowledge of Habash et al. using their training and testing corpus; it is found that the word error rates of 5.5% for the morphological diacritization and 9.4% for the syntactic diacritization by Habash et al., and only 3.1% for the morphological diacritization and 9.4% for the syntactic diacritization by our system.
  • Keywords
    learning (artificial intelligence); natural language processing; probability; search problems; stochastic processes; text analysis; A* lattice search algorithm; hybrid system; m-gram probability estimation; machine learning; maximum marginal probability; morphological constituent; morphological diacritization; out-of-vocabulary; stochastic Arabic hybrid diacritizer; syntactic diacritization; text analysis; Computer science; Lattices; Morphology; Speech synthesis; Stochastic processes; Stochastic systems; System testing; Tagging; Training data; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
  • Conference_Location
    Dalian
  • Print_ISBN
    978-1-4244-4538-7
  • Electronic_ISBN
    978-1-4244-4540-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2009.5313742
  • Filename
    5313742