• DocumentCode
    2664991
  • Title

    Improving Xtract for Chinese collocation extraction

  • Author

    Lu, Qin ; Li, Yin ; Xu, Ruifeng

  • Author_Institution
    Dept. of Comput., Hong Kong Polytech. Univ., China
  • fYear
    2003
  • fDate
    26-29 Oct. 2003
  • Firstpage
    333
  • Lastpage
    338
  • Abstract
    We present a system which extracts word-based bigram and n-gram collocation information from a 60MB corpus and then locates bigram pairs using strength and spread as defined in the Xtract system. In order for Xtract to work effectively with Chinese, we have readjusted the parameters. To obtain a higher recall rate, we have modified the algorithm to identify collocations with low-frequency of occurrence, a method which works particularly well in the case of bigrams in which one word is high-frequency and the other low-frequency. In preliminary experiments, our system extracts bigram collocations with a precision of 61%, an 8% improvement over the direct use Smadja´ Xtract on Chinese. Further, we have improved the recall rate by 4.5% while extracting multiword collocations with 92% precision.
  • Keywords
    computational linguistics; natural languages; statistical analysis; Chinese collocation extraction; Xtract system; bigram collocation; statistical modeling; Application software; Computer worms; Data mining; Frequency; Humans; Mutual information; Statistical analysis; Sun;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
  • Conference_Location
    Beijing, China
  • Print_ISBN
    0-7803-7902-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2003.1275925
  • Filename
    1275925