DocumentCode
2844354
Title
OOV words in an English-Arabic CLIR system
Author
Bellaachia, Abdelghani ; Amor-Tijani, G.
Author_Institution
Dept. of Comput. Sci., George Washington Univ., Washington, DC
fYear
2008
fDate
6-9 July 2008
Firstpage
874
Lastpage
882
Abstract
Proper nouns are usually primary keys in a query. Their correct translation might be necessary to maintain a good retrieval performance in a cross language information retrieval (CLIR) system. However, dictionaries only include the most commonly used proper nouns, like major countries and capitals. As they are spelling variants of each other in most languages, using an approximate string matching technique against the target database index is the common approach taken to find the target language correspondents of the original query key. N-gram technique proved to be the most effective among other approximate string matching techniques. As we are dealing with an English-Arabic CLIR system which involves two languages of different alphabets, we decided to combine transliteration with the n-gram technique to generate the different spelling variants of out of vocabulary (OOV) words. We call this technique: Transliteration Ngram (TNG). One issue that arises with the Arabic language is that words that are spelled similarly can have different meanings depending on the context of the sentence. This is particularly true for proper names, which usually have a meaning if used as a verb or adjective. To further enhance our transliteration approach, we chose to use part of speech (POS) disambiguation to reduce the number of unrelated words from the set transliterations obtained using TNG.
Keywords
database indexing; information retrieval systems; language translation; natural language processing; query processing; string matching; vocabulary; English-Arabic CLIR system; N-gram technique; OOV words; POS; TNG; approximate string matching technique; cross language information retrieval system; original query key; out of vocabulary; part of speech disambiguation; target database index; transliteration Ngram; transliteration approach; Computer science; Databases; Degradation; Dictionaries; Indexes; Information retrieval; Speech enhancement; Vocabulary;
fLanguage
English
Publisher
ieee
Conference_Titel
Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
Conference_Location
Marrakech
ISSN
1530-1346
Print_ISBN
978-1-4244-2702-4
Electronic_ISBN
1530-1346
Type
conf
DOI
10.1109/ISCC.2008.4625724
Filename
4625724
Link To Document