Title :
Mining a large Chinese-English corpus from web
Author :
Wang, Xuwen ; Lin, Ye ; Wang, Xiaojie ; Tan, Yongmei
Author_Institution :
Center for Intell. Sci. & Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
Bilingual parallel corpora provide rich source of translate information for tasks such as cross-language information retrieval and data-driven machine translation systems. However, they are often scarce resources: limited in size, language coverage and language register. Researchers have to struggle to transfer and adapt the available technologies because only some small scale corpora are suitable. In this paper we introduce a large parallel corpus of Chinese and English constructed by crawling and processing bilingual web documents from Internet. It currently contains about 300,000 parallel sentence pairs. The tools and methodology used in this collection project are also described. In a cross-language retrieval task we experimented with this self-constructed corpus to improve the quality of query translation, so as to achieve better retrieval performance.
Keywords :
Internet; data mining; document handling; information retrieval; language translation; natural language processing; Chinese-English corpus mining; Internet; bilingual Web document crawling; bilingual Web document processing; bilingual parallel corpora; cross-language information retrieval; data-driven machine translation systems; language coverage; language register; large parallel English corpus; parallel sentence pairs; query translation quality improvement; self-constructed corpus; Dictionaries; Unemployment; cross-language; parallel corpus; web crawling; web-based data mining;
Conference_Titel :
Electrical & Electronics Engineering (EEESYM), 2012 IEEE Symposium on
Conference_Location :
Kuala Lumpur
Print_ISBN :
978-1-4673-2363-5
DOI :
10.1109/EEESym.2012.6258758