• DocumentCode
    2481818
  • Title

    Semi-supervised Chinese compound word extraction based on HMM

  • Author

    He, Hui ; Chen, Bo ; Guo, Jun

  • Author_Institution
    Sch. of Inf. Eng., Beijing Univ. of Posts & Telecommun., Beijing
  • fYear
    2008
  • fDate
    25-27 June 2008
  • Firstpage
    2077
  • Lastpage
    2081
  • Abstract
    In natural languages, compound words play an important role and their automatically extraction is very helpful in information retrieval, information extraction and text classification. We introduce a semi-supervised Chinese compound extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position of words in compounds. Then we employ HMM to extract compounds automatically in BEMI tagging algorithm. We rank the Compounds extracted from corpus by their word frequency and length in descending order, and add top N compounds in seed compounds list. The algorithm learns more Chinese compounds from corpus by bootstrapping. Experimental results show that this approach get much higher performance than unsupervised one. Different from those extracted by traditional methods, these Chinese compounds contain category information, which can be used in text classification/clustering as features. Also, this approach can be applied in keyword recommendation system in advertisement for different kinds of advertisers because of its expansibility and versatility.
  • Keywords
    hidden Markov models; natural languages; text analysis; BEMI tagging algorithm; HMM; bootstrapping; information extraction; information retrieval; keyword recommendation system; natural language; semi-supervised Chinese compound word extraction; text classification; Automation; Clustering algorithms; Data mining; Helium; Hidden Markov models; Information retrieval; Intelligent control; Natural languages; Tagging; Text categorization; BEMI tagging algorithm; Bootstrapping; Compound word extraction; Hidden Markov models;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on
  • Conference_Location
    Chongqing
  • Print_ISBN
    978-1-4244-2113-8
  • Electronic_ISBN
    978-1-4244-2114-5
  • Type

    conf

  • DOI
    10.1109/WCICA.2008.4593244
  • Filename
    4593244