Title of article
A Heuristic Method Based on a Statistical Approach for Chinese Text Segmentation
Author/Authors
Christopher C. Yang and K. W. Li، نويسنده ,
Issue Information
ماهنامه با شماره پیاپی سال 2005
Pages
10
From page
1438
To page
1447
Abstract
The authors propose a heuristic method for Chinese
automatic text segmentation based on a statistical approach.
This method is developed based on statistical
information about the association among adjacent characters
in Chinese text. Mutual information of bi-grams and
significant estimation of tri-grams are utilized.Aheuristic
method with six rules is then proposed to determine the
segmentation points in a Chinese sentence.Nodictionary
is required in this method. Chinese text segmentation is
important in Chinese text indexing and thus greatly
affects the performance of Chinese information retrieval.
Due to the lack of delimiters of words in Chinese text,
Chinese text segmentation is more difficult than English
text segmentation. Besides, segmentation ambiguities
and occurrences of out-of-vocabulary words (i.e.,
unknown words) are the major challenges in Chinese
segmentation. Many research studies dealing with the
problem of word segmentation have focused on the resolution
of segmentation ambiguities. The problem of
unknown word identification has not drawn much attention.
The experimental result shows that the proposed
heuristic method is promising to segment the unknown
words as well as the known words. The authors further
investigated the distribution of the errors of commission
and the errors of omission caused by the proposed
heuristic method and benchmarked the proposed
heuristic method with a previous proposed technique,
boundary detection. It is found that the heuristic method
outperformed the boundary detection method.
Journal title
Journal of the American Society for Information Science and Technology
Serial Year
2005
Journal title
Journal of the American Society for Information Science and Technology
Record number
844037
Link To Document