• DocumentCode
    2844354
  • Title

    OOV words in an English-Arabic CLIR system

  • Author

    Bellaachia, Abdelghani ; Amor-Tijani, G.

  • Author_Institution
    Dept. of Comput. Sci., George Washington Univ., Washington, DC
  • fYear
    2008
  • fDate
    6-9 July 2008
  • Firstpage
    874
  • Lastpage
    882
  • Abstract
    Proper nouns are usually primary keys in a query. Their correct translation might be necessary to maintain a good retrieval performance in a cross language information retrieval (CLIR) system. However, dictionaries only include the most commonly used proper nouns, like major countries and capitals. As they are spelling variants of each other in most languages, using an approximate string matching technique against the target database index is the common approach taken to find the target language correspondents of the original query key. N-gram technique proved to be the most effective among other approximate string matching techniques. As we are dealing with an English-Arabic CLIR system which involves two languages of different alphabets, we decided to combine transliteration with the n-gram technique to generate the different spelling variants of out of vocabulary (OOV) words. We call this technique: Transliteration Ngram (TNG). One issue that arises with the Arabic language is that words that are spelled similarly can have different meanings depending on the context of the sentence. This is particularly true for proper names, which usually have a meaning if used as a verb or adjective. To further enhance our transliteration approach, we chose to use part of speech (POS) disambiguation to reduce the number of unrelated words from the set transliterations obtained using TNG.
  • Keywords
    database indexing; information retrieval systems; language translation; natural language processing; query processing; string matching; vocabulary; English-Arabic CLIR system; N-gram technique; OOV words; POS; TNG; approximate string matching technique; cross language information retrieval system; original query key; out of vocabulary; part of speech disambiguation; target database index; transliteration Ngram; transliteration approach; Computer science; Databases; Degradation; Dictionaries; Indexes; Information retrieval; Speech enhancement; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
  • Conference_Location
    Marrakech
  • ISSN
    1530-1346
  • Print_ISBN
    978-1-4244-2702-4
  • Electronic_ISBN
    1530-1346
  • Type

    conf

  • DOI
    10.1109/ISCC.2008.4625724
  • Filename
    4625724