OOV words in an English-Arabic CLIR system

Author

Bellaachia, Abdelghani ; Amor-Tijani, G.

Author_Institution

Dept. of Comput. Sci., George Washington Univ., Washington, DC

fYear

2008

fDate

6-9 July 2008

Firstpage

874

Lastpage

882

Abstract

Proper nouns are usually primary keys in a query. Their correct translation might be necessary to maintain a good retrieval performance in a cross language information retrieval (CLIR) system. However, dictionaries only include the most commonly used proper nouns, like major countries and capitals. As they are spelling variants of each other in most languages, using an approximate string matching technique against the target database index is the common approach taken to find the target language correspondents of the original query key. N-gram technique proved to be the most effective among other approximate string matching techniques. As we are dealing with an English-Arabic CLIR system which involves two languages of different alphabets, we decided to combine transliteration with the n-gram technique to generate the different spelling variants of out of vocabulary (OOV) words. We call this technique: Transliteration Ngram (TNG). One issue that arises with the Arabic language is that words that are spelled similarly can have different meanings depending on the context of the sentence. This is particularly true for proper names, which usually have a meaning if used as a verb or adjective. To further enhance our transliteration approach, we chose to use part of speech (POS) disambiguation to reduce the number of unrelated words from the set transliterations obtained using TNG.

Keywords

database indexing; information retrieval systems; language translation; natural language processing; query processing; string matching; vocabulary; English-Arabic CLIR system; N-gram technique; OOV words; POS; TNG; approximate string matching technique; cross language information retrieval system; original query key; out of vocabulary; part of speech disambiguation; target database index; transliteration Ngram; transliteration approach; Computer science; Databases; Degradation; Dictionaries; Indexes; Information retrieval; Speech enhancement; Vocabulary;

fLanguage

English

Publisher

ieee

Conference_Titel

Computers and Communications, 2008. ISCC 2008. IEEE Symposium on

Conference_Location

Marrakech

ISSN

1530-1346

Print_ISBN

978-1-4244-2702-4

Electronic_ISBN

1530-1346

Type

conf

DOI

10.1109/ISCC.2008.4625724

Filename

4625724