Detecting the Same Text in Different Languages

Author

Koroutchev, Kostadin ; Cebrián, Manuel

Author_Institution

Depto. de IngenierÃ\xada InformÃ¡tica, Universidad AutÃ³noma de Madrid, 28049 Madrid, Spain. k.koroutchev@uam.es

fYear

2006

fDate

Oct. 2006

Firstpage

337

Lastpage

341

Abstract

Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. When two texts are translated, there exists significant similarity with no literal coincidence. In this article, we present an algorithm that compares the redundancy structure of the data extracted by means of a Lempel- Ziv compression scheme. Each text is presented as a graph and two texts are considered similar with our measure if they have the same referential topology when compressed. We give empirical evidence that this measure detects similarity between data coded in different languages.

Keywords

Compression algorithms; Computer science education; Conferences; Data mining; Entropy; Humans; Information theory; Length measurement; Object detection; Topology;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Theory Workshop, 2006. ITW '06 Chengdu. IEEE

Conference_Location

Chengdu, China

Print_ISBN

1-4244-0067-8

Electronic_ISBN

1-4244-0068-6

Type

conf

DOI

10.1109/ITW2.2006.323816

Filename

4119314

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1830907