• DocumentCode
    2710568
  • Title

    MWUs Extraction Based on Continuous Measurement of Inter-word Association with Frequency Adjustment

  • Author

    Wang, Zhifei ; Chen, Yue ; Jiang, XiaoYu

  • fYear
    2010
  • fDate
    7-10 May 2010
  • Firstpage
    647
  • Lastpage
    651
  • Abstract
    Extracting Multi-Word Units (MWUs) from raw text is a significant problem in natural language processing due to MWUs describe concept more accurate than single word. The statistical methods such as Mutual Information, Log-Likelihood Ratio and Chi-Squared test etc., rely on frequency of words extremely because the component words of MWUs tend to co-occur more often, and that the main components of multi-word phrase are the core terms in the text document. These core terms have a very high frequency generally and their word-building powers are very strong, so the frequency of these core terms is far higher than other component words of MWUs, and thus reduce the accuracy of the method. We proposed a method to adjust the frequency of the core words. Experimental results show that the method significantly improved the recall of the multi-word combinations and preserving the precision.
  • Keywords
    natural language processing; statistical analysis; text analysis; word processing; MWU extraction; continuous measurement; frequency adjustment; interword association; multiword combination; multiword unit; natural language processing; statistical method; text document; word building power; Data mining; Filtering; Frequency measurement; Large-scale systems; Mutual information; Natural language processing; Natural languages; Ontologies; Statistical analysis; Testing; Association Measurement; Frequency Adjustment; MWUs Extraction; Mutual Information; Term Extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Research and Development, 2010 Second International Conference on
  • Conference_Location
    Kuala Lumpur
  • Print_ISBN
    978-0-7695-4043-6
  • Type

    conf

  • DOI
    10.1109/ICCRD.2010.140
  • Filename
    5489550