Title of article

Discovering Chinese Words from Unsegmented Text

Author/Authors

Ge، Xianping نويسنده , , Pratt، Wanda نويسنده , , Smyth، Padhraic نويسنده ,

Issue Information

روزنامه با شماره پیاپی سال 1999

Pages

-270

From page

271

To page

Abstract

In English written text, words are separated by spaces, but in written Chinese text, there are no such separators between words. (See Figure 1.) Thus, effective information retrieval of Chinese text first requires good word segmentation. In this paper, we investigate an efficient algorithm to discover the words and their occurrence probabilities from a corpus of unsegmented text without using a dictionary. Using the probabilities of the words, word segmentation is done according to the maximum likelihood principle. Comparing the segmentation output by the algorithm with the correct segmentation, recall/precision of 65.65%/71.91% is achieved. If some simple post-processing is performed, recall/precision can be boosted up to 97.72%/91.05%.

Keywords

comparing interfaces for information access , field/empirical studies of the information seeking process , Speech indexing and retrieval , User studies

Journal title

SIGIR FORUM

Serial Year

1999

Journal title

SIGIR FORUM

Record number

16704

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=16704