Filtering noisy parallel corpora of web pages

Author

Nie, Jian-Yun ; Cai, Jian

Author_Institution

Dept. d´´Inf. et de Recherche Oper., Montreal Univ., Que., Canada

Volume

fYear

2001

fDate

2001

Firstpage

453

Abstract

In our previous study, we successfully built an automatic mining system for parallel texts from the Web - PTMiner that is able to determine a large number of parallel Web pages for different language pairs. However, there are a number of non-parallel text pairs in this corpus. This paper proposes a filtering approach to clean up the corpus. Our experiments show that once the corpus is cleaned, both the translation accuracy of the resulting translation models and the effectiveness of cross-language information retrieval (CLIR) using these models are improved significantly

Keywords

data mining; information retrieval; natural language interfaces; PTMiner; automatic mining system; cross-language information retrieval; filtering; non-parallel text pairs; parallel Web pages; Availability; Data mining; Databases; Dictionaries; Information filtering; Information filters; Information retrieval; Search engines; Terminology; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Systems, Man, and Cybernetics, 2001 IEEE International Conference on

Conference_Location

Tucson, AZ

ISSN

1062-922X

Print_ISBN

0-7803-7087-2

Type

conf

DOI

10.1109/ICSMC.2001.969854

Filename

969854

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=376267