Title of article :
Using statistical and contextual information to identify two- and three-character words in Chinese text
Author/Authors :
Christopher S.G. Khoo1، نويسنده , , Yubin Dai1، نويسنده , , Teck Ee Loh2، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2002
Pages :
13
From page :
365
To page :
377
Abstract :
New statistical formulas were developed for identifying two- and three-character words in Chinese text. The formulas were constructed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. For identifying two-character words, the relative frequency of the adjacent characters and the document frequency of the overlapping bigrams were found to be significant factors. These provide information about the immediate neighborhood or context of the character string. Contextual information was also found to be significant in predicting three-character words. Local information (the number of times the bigram or trigram occurs in the document being segmented) and the position of the bigram/trigram in the sentence were not found to be useful in identifying words. The new formulas, called contextual information formulas, were found to be substantially better than the mutual information formula in identifying two- and three-character words. Using the contextual information formulas for both two- and three-character words gave significantly better results than using the formula for two-character words alone. The method can also be used for identifying multiword terms in English text.
Journal title :
Journal of the American Society for Information Science and Technology
Serial Year :
2002
Journal title :
Journal of the American Society for Information Science and Technology
Record number :
993221
Link To Document :
بازگشت