• DocumentCode
    475894
  • Title

    A new collocation extraction method combining multiple association measures

  • Author

    Lin, Jian-fang ; Li, Sheng ; Cai, Yuhan

  • Author_Institution
    MOE-MS Key Lab. of NLP & Speech, Harbin Inst. of Technol., Harbin
  • Volume
    1
  • fYear
    2008
  • fDate
    12-15 July 2008
  • Firstpage
    12
  • Lastpage
    17
  • Abstract
    As an important linguistic resource, collocation represents a significant relation between words. Automatic collocation extraction is very important for many natural language processing applications, such as word sense disambiguation, machine translation and information retrieval etc. While traditional collocation extraction approaches use only one single statistical measure, they may not be optimal in that they can not take advantage of multiple statistical measures. In this paper, we propose a logistic linear regression model (LLRM) that combines five classical lexical association measures: x2-test, t-test, co-occurrence frequency, log-likelihood ratio and mutual information. Experiments show that our approach leads to a significant performance improvement in comparison with individual basic methods in both precision and recall.
  • Keywords
    natural language processing; regression analysis; text analysis; automatic collocation extraction; collocation extraction method; information retrieval; log-likelihood ratio; logistic linear regression model; machine translation; multiple association measures; natural language processing applications; Cybernetics; Data mining; Frequency measurement; Information retrieval; Laboratories; Linear regression; Logistics; Machine learning; Magnetic heads; Mutual information; Co-occurrence frequency; Collocation; Log-likelihood ratio; Mutual information; T-test; X2-test;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2008 International Conference on
  • Conference_Location
    Kunming
  • Print_ISBN
    978-1-4244-2095-7
  • Electronic_ISBN
    978-1-4244-2096-4
  • Type

    conf

  • DOI
    10.1109/ICMLC.2008.4620370
  • Filename
    4620370