DocumentCode
2188864
Title
Modelling Parallel Texts for Boosting Compression
Author
Adiego, Joaquín ; Martinez-Prieto, M.A. ; Hoyos-Torio, Javier E ; Sanchez-Martinez, Felipe
Author_Institution
Dept. de Inf., Univ. de Valladolid, Valladolid, Spain
fYear
2010
fDate
24-26 March 2010
Firstpage
517
Lastpage
517
Abstract
Bilingual parallel corpora, also known as bitexts, convey the same information in two different languages. This implies that to model a bitext we can take advantage of the translation relationship that exists between the two texts; the text alignment task makes it possible to establish such a translation relationship. A biword is defined as a pair of words, each from a different text, that are mutual translations in the bitext; the use of biwords allows both texts in the bitext to be represented on a single model. Several biword-based schemes have been proposed leading to good compression ratios. Bearing in mind Melamed\´s affirmation which states that "the translation of a text into another language can be viewed as a detailed annotation of what that text means", we propose a new model for bitexts in agreement with this affirmation, dubbed MAR. The idea is to represent the words in the right text with respect to the preceding word in the left text; thus, a first-order model based on alignment relationships is proposed.
Keywords
data compression; text analysis; bilingual parallel corpora; bitext; biword-based scheme; boosting compression; parallel text; text alignment task; Boosting; Data compression; Dictionaries; Information retrieval; Bitext Compression; Compression Boosting; PPM;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference (DCC), 2010
Conference_Location
Snowbird, UT
ISSN
1068-0314
Print_ISBN
978-1-4244-6425-8
Electronic_ISBN
1068-0314
Type
conf
DOI
10.1109/DCC.2010.86
Filename
5453473
Link To Document