Title :
Document and sentence alignment in comparable corpora using bipartite graph matching
Author :
Rahimi, Zahra ; Taghipour, K. ; Khadivi, Shahram ; Afhami, N.
Author_Institution :
Dept. of Comput. Eng., Amirkabir Univ. of Technol. (Tehran Polytech.), Tehran, Iran
Abstract :
Parallel corpora are considered as an inevitable resource of statistical machine translation systems, and can be obtained from parallel, comparable or non-parallel documents. Parallel documents are more suitable resources but due to their shortage, comparable and nonparallel documents are also used. In this paper, we address both document alignment and sentence alignment in comparable documents as an assignment problem of bipartite graph matching and intend to find the sub graphs having the maximum weight. One of the best methods to solve this problem is Hungarian algorithm which is a combinatorial optimization problem with known mathematical solutions. The advantages of proposed method are language independency and time complexity of O(n3) for Hungarian algorithm. We have applied this method to bilingual Farsi-English corpus, and obtained high precision and recall for this method.
Keywords :
document handling; graph theory; language translation; mathematical programming; natural language processing; parallel processing; pattern matching; Hungarian algorithm problem; bilingual Farsi-English corpus; bipartite graph matching; combinatorial optimization problem; comparable documents; document alignment; language independency; mathematical solutions; nonparallel documents; parallel corpora; sentence alignment; statistical machine translation systems; subgraphs; time complexity; Bipartite graph; Computational linguistics; Computational modeling; Data mining; Gaussian distribution; Mathematical model; comparable corpora; document alignment; hungarian algorithm; machine translation; sentence alignment;
Conference_Titel :
Telecommunications (IST), 2012 Sixth International Symposium on
Conference_Location :
Tehran
Print_ISBN :
978-1-4673-2072-6
DOI :
10.1109/ISTEL.2012.6483098