DocumentCode
806663
Title
Identifying Language Origin of Named Entity With Multiple Information Sources
Author
You, Jia-Li ; Chen, Yi-Ning ; Chu, Min ; Soong, Frank K. ; Wang, Jin-Lin
Author_Institution
Inst. of Acoust., Chinese Acad. of Sci., Beijing
Volume
16
Issue
6
fYear
2008
Firstpage
1077
Lastpage
1086
Abstract
To identify the language origin of a named entity, morphological information associated with its letter spelling, such as letter N-grams, is commonly employed. However, with this information only, named entities with similar spellings but from different language origins are difficult to differentiate. In this paper, a measure of "popularity," in terms of frequency or page count of the named entity in language-specific Web search, is proposed for identifying its language origin. Morphological information, including letter or letter-chunk N-grams, is used to enhance the performance of language identification in conjunction with Web-based page counts. Six languages, including English, German, French, Portuguese, Chinese, and Japanese (Chinese and Japanese named entities are shown in their corresponding phonetic alphabets, i.e., Pinyin and Romaji), are tested. Experiments show that when classifying four Latin languages, including English, German, French, and Portuguese, which are written in Latin alphabets, features from different information sources yield substantial performance improvements in the classification accuracy over a letter 4-gram-based baseline system. The accuracy increases from 75.0% to 86.3%, or a 45.2% relative error reduction.
Keywords
Internet; linguistics; search engines; Chinese; English; French; German; Japanese; Latin alphabets; Latin languages; Portuguese; Web-based page counts; language origin identification; language-specific Web search; letter N-grams; letter spelling; morphological information; named entity; Acoustics; Asia; Entropy; Frequency measurement; Hidden Markov models; Natural languages; Speech recognition; Speech synthesis; Testing; Web search; Language identification; Web search; named entity;
fLanguage
English
Journal_Title
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher
ieee
ISSN
1558-7916
Type
jour
DOI
10.1109/TASL.2008.2001110
Filename
4566082
Link To Document