• DocumentCode
    3489798
  • Title

    A Dictionary Based Urdu Word Segmentation Using Maximum Matching Algorithm for Space Omission Problem

  • Author

    Rashid, Rasber ; Latif, Saeed

  • Author_Institution
    Coll. of Telecommun. Eng., Nat. Univ. of Sci. & Technol. (NUST), Islamabad, Pakistan
  • fYear
    2012
  • fDate
    13-15 Nov. 2012
  • Firstpage
    101
  • Lastpage
    104
  • Abstract
    The foremost step in any Natural Language Processing system is Word Segmentation. Word segmentation means dividing a sentence into the words it consists. For this research purpose Urdu is selected because very less work has been done. In Urdu space cannot be used in marking word boundary because it is not consistently used. Urdu word segmentation is different from other Asian languages in that it consist both Space Omission and Space Insertion problem. This paper discusses these problems and suggests a technique that solves both of these problems. It uses simple and already used basic techniques in a different way to develop an efficient Segmentation Algorithm. Morphological analysis of Urdu Text is also taken into account. Dictionary is used for verification and identification of Urdu Words. This work has been tested on words collected from Geo, Jang, BBC news sites and other online documents available on internet. The proposed algorithm has been tested on 11,995 words and 97.2% of these words are segmented correctly.
  • Keywords
    Internet; dictionaries; electronic publishing; natural language processing; pattern matching; text analysis; word processing; Internet; Urdu word identification; Urdu word verification; dictionary based Urdu word segmentation; maximum matching algorithm; morphological Urdu text analysis; natural language processing system; news sites; online documents; space insertion problem; space omission problem; word boundary marking; Space Insertion problem; Space Omission problem; Urdu Word Segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2012 International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4673-6113-2
  • Electronic_ISBN
    978-0-7695-4886-9
  • Type

    conf

  • DOI
    10.1109/IALP.2012.11
  • Filename
    6473706