• DocumentCode
    530273
  • Title

    Urdu noun phrase chunking: HMM based approach

  • Author

    Ali, Wajid ; Malik, M. Kamran ; Hussain, Sarmad ; Siddiq, Shahid ; Ali, Aasim

  • Author_Institution
    Dept. of Comput. Sci., Nat. Univ. of Comput. & Emerging Sci. (NUCES), Lahore, Pakistan
  • Volume
    2
  • fYear
    2010
  • fDate
    17-19 Sept. 2010
  • Abstract
    Extraction of noun phrase (NP) from text is useful for many natural language processing applications, such as name entity recognition, indexing, searching, parsing etc. We present a noun phrase chunker for Urdu which is based on a statistical approach. A 100,000 words Urdu corpus is manually tagged with NP chunk tags. The corpus is used to develop a statistical approach. Initially, a statistical approach based on standard HMM model is developed for automatics NP chunking. In Urdu phrases, the case marker (CM) indicates the end of a noun phrase and is appended at its end. Thus, if one scans the sentence in reverse order, one may be able to better predict phrase endings. So, the technique is enhanced by changing scanning direction. The technique is further enhanced by merging chunk and POS tags to achieve maximum accuracy. The results of all experiments are reported with maximum overall accuracy of 97.61% achieved using HMM based approach with extended tagset and right to left (RTL) scanning.
  • Keywords
    cognition; hidden Markov models; natural language processing; NP chunk tags; POS tags; Urdu noun phrase chunking; automatics NP chunking; case marker; chunk merging; natural language processing; noun phrase chunker; noun phrase extraction; phrase endings; scanning direction; standard HMM model; Cardiology; Hidden Markov models; Random access memory; Testing; HMM based chunking; NP chunking; Statistical Chunking; Urdu Noun Phrase; chunking;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Educational and Information Technology (ICEIT), 2010 International Conference on
  • Conference_Location
    Chongqing
  • Print_ISBN
    978-1-4244-8033-3
  • Electronic_ISBN
    978-1-4244-8035-7
  • Type

    conf

  • DOI
    10.1109/ICEIT.2010.5607623
  • Filename
    5607623