• DocumentCode
    2188864
  • Title

    Modelling Parallel Texts for Boosting Compression

  • Author

    Adiego, Joaquín ; Martinez-Prieto, M.A. ; Hoyos-Torio, Javier E ; Sanchez-Martinez, Felipe

  • Author_Institution
    Dept. de Inf., Univ. de Valladolid, Valladolid, Spain
  • fYear
    2010
  • fDate
    24-26 March 2010
  • Firstpage
    517
  • Lastpage
    517
  • Abstract
    Bilingual parallel corpora, also known as bitexts, convey the same information in two different languages. This implies that to model a bitext we can take advantage of the translation relationship that exists between the two texts; the text alignment task makes it possible to establish such a translation relationship. A biword is defined as a pair of words, each from a different text, that are mutual translations in the bitext; the use of biwords allows both texts in the bitext to be represented on a single model. Several biword-based schemes have been proposed leading to good compression ratios. Bearing in mind Melamed\´s affirmation which states that "the translation of a text into another language can be viewed as a detailed annotation of what that text means", we propose a new model for bitexts in agreement with this affirmation, dubbed MAR. The idea is to represent the words in the right text with respect to the preceding word in the left text; thus, a first-order model based on alignment relationships is proposed.
  • Keywords
    data compression; text analysis; bilingual parallel corpora; bitext; biword-based scheme; boosting compression; parallel text; text alignment task; Boosting; Data compression; Dictionaries; Information retrieval; Bitext Compression; Compression Boosting; PPM;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference (DCC), 2010
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-1-4244-6425-8
  • Electronic_ISBN
    1068-0314
  • Type

    conf

  • DOI
    10.1109/DCC.2010.86
  • Filename
    5453473