مرکز منطقه ای اطلاع رساني علوم و فناوري - Using statistical and contextual information to identify two- and three-character words in Chinese text

Abstract :

New statistical formulas were developed for identifying two- and three-character words in Chinese text. The formulas were constructed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. For identifying two-character words, the relative frequency of the adjacent characters and the document frequency of the overlapping bigrams were found to be significant factors. These provide information about the immediate neighborhood or context of the character string. Contextual information was also found to be significant in predicting three-character words. Local information (the number of times the bigram or trigram occurs in the document being segmented) and the position of the bigram/trigram in the sentence were not found to be useful in identifying words. The new formulas, called contextual information formulas, were found to be substantially better than the mutual information formula in identifying two- and three-character words. Using the contextual information formulas for both two- and three-character words gave significantly better results than using the formula for two-character words alone. The method can also be used for identifying multiword terms in English text.