DocumentCode :
1736386
Title :
Detecting the Same Text in Different Languages
Author :
Koroutchev, Kostadin ; Cebrian, M.
Author_Institution :
Depto. de IngenierÃ\xada Informática, Universidad Autónoma de Madrid, 28049 Madrid, Spain. k.koroutchev@uam.es
fYear :
2006
Firstpage :
337
Lastpage :
341
Abstract :
Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. When two texts are translated, there exists significant similarity with no literal coincidence. In this article, we present an algorithm that compares the redundancy structure of the data extracted by means of a Lempel- Ziv compression scheme. Each text is presented as a graph and two texts are considered similar with our measure if they have the same referential topology when compressed. We give empirical evidence that this measure detects similarity between data coded in different languages.
Keywords :
Compression algorithms; Computer science education; Data mining; Entropy; H infinity control; Humans; Length measurement; Testing; Tin; Topology;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Theory Workshop, 2006. ITW '06 Punta del Este. IEEE
Conference_Location :
Punta del Este, Uruguay
Print_ISBN :
1-4244-0035-X
Electronic_ISBN :
1-4244-0036-8
Type :
conf
DOI :
10.1109/ITW.2006.322834
Filename :
4117489
Link To Document :
بازگشت