Mining a large Chinese-English corpus from web

Author

Wang, Xuwen ; Lin, Ye ; Wang, Xiaojie ; Tan, Yongmei

Author_Institution

Center for Intell. Sci. & Technol., Beijing Univ. of Posts & Telecommun., Beijing, China

fYear

2012

fDate

24-27 June 2012

Firstpage

711

Lastpage

716

Abstract

Bilingual parallel corpora provide rich source of translate information for tasks such as cross-language information retrieval and data-driven machine translation systems. However, they are often scarce resources: limited in size, language coverage and language register. Researchers have to struggle to transfer and adapt the available technologies because only some small scale corpora are suitable. In this paper we introduce a large parallel corpus of Chinese and English constructed by crawling and processing bilingual web documents from Internet. It currently contains about 300,000 parallel sentence pairs. The tools and methodology used in this collection project are also described. In a cross-language retrieval task we experimented with this self-constructed corpus to improve the quality of query translation, so as to achieve better retrieval performance.

Keywords

Internet; data mining; document handling; information retrieval; language translation; natural language processing; Chinese-English corpus mining; Internet; bilingual Web document crawling; bilingual Web document processing; bilingual parallel corpora; cross-language information retrieval; data-driven machine translation systems; language coverage; language register; large parallel English corpus; parallel sentence pairs; query translation quality improvement; self-constructed corpus; Dictionaries; Unemployment; cross-language; parallel corpus; web crawling; web-based data mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Electrical & Electronics Engineering (EEESYM), 2012 IEEE Symposium on

Conference_Location

Kuala Lumpur

Print_ISBN

978-1-4673-2363-5

Type

conf

DOI

10.1109/EEESym.2012.6258758

Filename

6258758