شماره ركورد كنفرانس :
3540
عنوان مقاله :
Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method
Author/Authors :
Z Rahimi Department of Computer Engineering - Amirkabir University of Technology, Tehran, Iran , M.H Samani Department of Computer Engineering - Amirkabir University of Technology, Tehran, Iran , S Khadivi Department of Computer Engineering - Amirkabir University of Technology, Tehran, Iran
كليدواژه :
Comparable Corpora , Parallel Fragments , Machine Translation
عنوان كنفرانس :
همايش بين المللي هوش مصنوعي و پردازش سيگنال
چكيده لاتين :
Here, a novel method for extracting parallel sub-sentential fragments from comparable corpora
is presented. The proposed method aims to extract bilingual sentence fragments from noisy sentence
pairs. We define a similarity measure between bilingual sentence fragments which is actually a
linear combination of some new features. The features are such as fragment length, LLR score, alignment
path specifications in the block and translation coverage fraction. This method enables us to extract
useful machine translation training data from comparable corpora that contain no parallel sentence
pairs. Evaluations indicate that proposed method is very efficient and not only outperforms the
existing similar systems in the measure of precision and recall; it also helps to improve the performance
of a statistical machine translation system.