DocumentCode
3060424
Title
Web-based parallel corpora for statistical machine translation
Author
Li, Bo ; Liu, Juan ; Shi, Wenjuan
Author_Institution
Wuhan Univ., Wuhan
fYear
2007
fDate
13-15 Dec. 2007
Firstpage
444
Lastpage
449
Abstract
Statistical machine translation is the state-of-the- art technique based on sentence-level aligned parallel corpora. The improvement of this kind of technique is constrained by the lack of parallel corpora publicly available. The booming of the World Wide Web stands a fair chance that we can construct parallel corpora in a big scale more easily. In this paper, we summarize the current strategies fetching parallel corpora from the Web and classify them into three classes: the structure-based, the content-based and the hybrid. We compare these approaches and bring out some ideas that may be useful for improving the performance of the algorithms. In the discussion section, we put forward some problems that should be considered in future research.
Keywords
Internet; language translation; statistical analysis; Web-based parallel corpora; World Wide Web; content-based strategy; statistical machine translation; structure-based strategy; Application software; Computer science; Feeds; Law; Machine learning; Surface-mount technology; Uniform resource locators; Web pages; Web sites; World Wide Web;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on
Conference_Location
Cincinnati, OH
Print_ISBN
978-0-7695-3069-7
Type
conf
DOI
10.1109/ICMLA.2007.24
Filename
4457270
Link To Document