• DocumentCode
    376267
  • Title

    Filtering noisy parallel corpora of web pages

  • Author

    Nie, Jian-Yun ; Cai, Jian

  • Author_Institution
    Dept. d´´Inf. et de Recherche Oper., Montreal Univ., Que., Canada
  • Volume
    1
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    453
  • Abstract
    In our previous study, we successfully built an automatic mining system for parallel texts from the Web - PTMiner that is able to determine a large number of parallel Web pages for different language pairs. However, there are a number of non-parallel text pairs in this corpus. This paper proposes a filtering approach to clean up the corpus. Our experiments show that once the corpus is cleaned, both the translation accuracy of the resulting translation models and the effectiveness of cross-language information retrieval (CLIR) using these models are improved significantly
  • Keywords
    data mining; information retrieval; natural language interfaces; PTMiner; automatic mining system; cross-language information retrieval; filtering; non-parallel text pairs; parallel Web pages; Availability; Data mining; Databases; Dictionaries; Information filtering; Information filters; Information retrieval; Search engines; Terminology; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man, and Cybernetics, 2001 IEEE International Conference on
  • Conference_Location
    Tucson, AZ
  • ISSN
    1062-922X
  • Print_ISBN
    0-7803-7087-2
  • Type

    conf

  • DOI
    10.1109/ICSMC.2001.969854
  • Filename
    969854