• DocumentCode
    2247809
  • Title

    Study on Wikipedia for translation mining for CLIR

  • Author

    Yao, Jian-min ; Sun, Chang-long ; Hong, Yu ; Ge, Yun-dong ; Zhu, Qiao-min

  • Author_Institution
    Key Lab. of Comput. Inf. Process., Suzhou Univ., Suzhou, China
  • Volume
    6
  • fYear
    2010
  • fDate
    11-14 July 2010
  • Firstpage
    3374
  • Lastpage
    3379
  • Abstract
    The query translation of Out of Vocabulary (OOV) is one of the key factors that affect the performance of Cross-Language Information Retrieval (CLIR). Based on Wikipedia data structure and language features, the paper divides translation environment into target-existence and target-deficit environment. To overcome the difficulty of translation mining in the target-deficit environment, the frequency change information and adjacency information is used to realize the extraction of candidate units, and establish the strategy of mixed translation mining based on the frequency-distance model, surface pattern matching model and summary-score model. Search engine based OOV translation mining is taken as baseline to test the performance on TOP1 results. It is verified that the mixed translation mining method based on Wikipedia can achieve the precision rate of 0.6279, and the improvement is 6.98% better than the baseline.
  • Keywords
    computational linguistics; data mining; information retrieval; search engines; CLIR; OOV; Wikipedia data structure; cross-language information retrieval; frequency adjacency information; frequency change information; frequency-distance model; language features; mixed translation mining; out of vocabulary; search engine; summary-score model; surface pattern matching model; Data mining; Electronic publishing; Encyclopedias; Equations; Internet; Mathematical model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
  • Conference_Location
    Qingdao
  • Print_ISBN
    978-1-4244-6526-2
  • Type

    conf

  • DOI
    10.1109/ICMLC.2010.5580683
  • Filename
    5580683