Title :
Semi-supervised Chinese compound word extraction based on HMM
Author :
He, Hui ; Chen, Bo ; Guo, Jun
Author_Institution :
Sch. of Inf. Eng., Beijing Univ. of Posts & Telecommun., Beijing
Abstract :
In natural languages, compound words play an important role and their automatically extraction is very helpful in information retrieval, information extraction and text classification. We introduce a semi-supervised Chinese compound extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position of words in compounds. Then we employ HMM to extract compounds automatically in BEMI tagging algorithm. We rank the Compounds extracted from corpus by their word frequency and length in descending order, and add top N compounds in seed compounds list. The algorithm learns more Chinese compounds from corpus by bootstrapping. Experimental results show that this approach get much higher performance than unsupervised one. Different from those extracted by traditional methods, these Chinese compounds contain category information, which can be used in text classification/clustering as features. Also, this approach can be applied in keyword recommendation system in advertisement for different kinds of advertisers because of its expansibility and versatility.
Keywords :
hidden Markov models; natural languages; text analysis; BEMI tagging algorithm; HMM; bootstrapping; information extraction; information retrieval; keyword recommendation system; natural language; semi-supervised Chinese compound word extraction; text classification; Automation; Clustering algorithms; Data mining; Helium; Hidden Markov models; Information retrieval; Intelligent control; Natural languages; Tagging; Text categorization; BEMI tagging algorithm; Bootstrapping; Compound word extraction; Hidden Markov models;
Conference_Titel :
Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on
Conference_Location :
Chongqing
Print_ISBN :
978-1-4244-2113-8
Electronic_ISBN :
978-1-4244-2114-5
DOI :
10.1109/WCICA.2008.4593244