DocumentCode
2852992
Title
Mining a large Chinese-English corpus from web
Author
Wang, Xuwen ; Lin, Ye ; Wang, Xiaojie ; Tan, Yongmei
Author_Institution
Center for Intell. Sci. & Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2012
fDate
24-27 June 2012
Firstpage
711
Lastpage
716
Abstract
Bilingual parallel corpora provide rich source of translate information for tasks such as cross-language information retrieval and data-driven machine translation systems. However, they are often scarce resources: limited in size, language coverage and language register. Researchers have to struggle to transfer and adapt the available technologies because only some small scale corpora are suitable. In this paper we introduce a large parallel corpus of Chinese and English constructed by crawling and processing bilingual web documents from Internet. It currently contains about 300,000 parallel sentence pairs. The tools and methodology used in this collection project are also described. In a cross-language retrieval task we experimented with this self-constructed corpus to improve the quality of query translation, so as to achieve better retrieval performance.
Keywords
Internet; data mining; document handling; information retrieval; language translation; natural language processing; Chinese-English corpus mining; Internet; bilingual Web document crawling; bilingual Web document processing; bilingual parallel corpora; cross-language information retrieval; data-driven machine translation systems; language coverage; language register; large parallel English corpus; parallel sentence pairs; query translation quality improvement; self-constructed corpus; Dictionaries; Unemployment; cross-language; parallel corpus; web crawling; web-based data mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical & Electronics Engineering (EEESYM), 2012 IEEE Symposium on
Conference_Location
Kuala Lumpur
Print_ISBN
978-1-4673-2363-5
Type
conf
DOI
10.1109/EEESym.2012.6258758
Filename
6258758
Link To Document