DocumentCode
2481818
Title
Semi-supervised Chinese compound word extraction based on HMM
Author
He, Hui ; Chen, Bo ; Guo, Jun
Author_Institution
Sch. of Inf. Eng., Beijing Univ. of Posts & Telecommun., Beijing
fYear
2008
fDate
25-27 June 2008
Firstpage
2077
Lastpage
2081
Abstract
In natural languages, compound words play an important role and their automatically extraction is very helpful in information retrieval, information extraction and text classification. We introduce a semi-supervised Chinese compound extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position of words in compounds. Then we employ HMM to extract compounds automatically in BEMI tagging algorithm. We rank the Compounds extracted from corpus by their word frequency and length in descending order, and add top N compounds in seed compounds list. The algorithm learns more Chinese compounds from corpus by bootstrapping. Experimental results show that this approach get much higher performance than unsupervised one. Different from those extracted by traditional methods, these Chinese compounds contain category information, which can be used in text classification/clustering as features. Also, this approach can be applied in keyword recommendation system in advertisement for different kinds of advertisers because of its expansibility and versatility.
Keywords
hidden Markov models; natural languages; text analysis; BEMI tagging algorithm; HMM; bootstrapping; information extraction; information retrieval; keyword recommendation system; natural language; semi-supervised Chinese compound word extraction; text classification; Automation; Clustering algorithms; Data mining; Helium; Hidden Markov models; Information retrieval; Intelligent control; Natural languages; Tagging; Text categorization; BEMI tagging algorithm; Bootstrapping; Compound word extraction; Hidden Markov models;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on
Conference_Location
Chongqing
Print_ISBN
978-1-4244-2113-8
Electronic_ISBN
978-1-4244-2114-5
Type
conf
DOI
10.1109/WCICA.2008.4593244
Filename
4593244
Link To Document