• DocumentCode
    2348960
  • Title

    A pragmatic model for new Chinese word extraction

  • Author

    Zhang, Haijun ; Huang, Heyan ; Zhu, Chaoyong ; Shi, Shumin

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Xinjiang Normal Univ., Urumqi, China
  • fYear
    2010
  • fDate
    21-23 Aug. 2010
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    This paper proposed a pragmatic model for repeat-based Chinese New Word Extraction (NWE). It contains two innovations. The first is a formal description for the process of NWE, which gives instructions on feature selection in theory. On the basis of this, the Conditional Random Fields model (CRF) is selected as statistical framework to solve the formal description. The second is an improved algorithm for left (right) entropy to improve the efficiency of NWE. By comparing with baseline algorithm, the improved algorithm can enhance the computational speed of entropy remarkably. On the whole, experiments show that the model this paper proposed is very effective, and the F score is 49.72% in open test and 69.83% in word extraction respectively, which is an evident improvement over previous similar works.
  • Keywords
    entropy; natural language processing; statistical analysis; conditional random fields model; entropy; pragmatic model; repeat-based Chinese new word extraction; statistical framework; Educational institutions; New words extraction; computational efficiency; formal description; left (right) entropy; repeat;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6896-6
  • Type

    conf

  • DOI
    10.1109/NLPKE.2010.5587846
  • Filename
    5587846