• DocumentCode
    2852992
  • Title

    Mining a large Chinese-English corpus from web

  • Author

    Wang, Xuwen ; Lin, Ye ; Wang, Xiaojie ; Tan, Yongmei

  • Author_Institution
    Center for Intell. Sci. & Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2012
  • fDate
    24-27 June 2012
  • Firstpage
    711
  • Lastpage
    716
  • Abstract
    Bilingual parallel corpora provide rich source of translate information for tasks such as cross-language information retrieval and data-driven machine translation systems. However, they are often scarce resources: limited in size, language coverage and language register. Researchers have to struggle to transfer and adapt the available technologies because only some small scale corpora are suitable. In this paper we introduce a large parallel corpus of Chinese and English constructed by crawling and processing bilingual web documents from Internet. It currently contains about 300,000 parallel sentence pairs. The tools and methodology used in this collection project are also described. In a cross-language retrieval task we experimented with this self-constructed corpus to improve the quality of query translation, so as to achieve better retrieval performance.
  • Keywords
    Internet; data mining; document handling; information retrieval; language translation; natural language processing; Chinese-English corpus mining; Internet; bilingual Web document crawling; bilingual Web document processing; bilingual parallel corpora; cross-language information retrieval; data-driven machine translation systems; language coverage; language register; large parallel English corpus; parallel sentence pairs; query translation quality improvement; self-constructed corpus; Dictionaries; Unemployment; cross-language; parallel corpus; web crawling; web-based data mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical & Electronics Engineering (EEESYM), 2012 IEEE Symposium on
  • Conference_Location
    Kuala Lumpur
  • Print_ISBN
    978-1-4673-2363-5
  • Type

    conf

  • DOI
    10.1109/EEESym.2012.6258758
  • Filename
    6258758