Title :
A New Normalized Similarity for Discriminating Similar Documents
Author :
Ji, Jeong-Hoon ; Ryu, Chang-Keon ; Woo, Gyun ; Cho, Hwan-Gue
Author_Institution :
Dept. of Comput. Eng., Pusan Nat. Univ., Pusan
Abstract :
To find out similar document pairs from a set of documents, computing normalization similarities is inevitable because the sizes of documents are different from documents to documents. However, the normalized similarities proposed up to now are still unreliably sensitive to the size of programs compared. Due to this fact, most previously announced similarity detection tools have difficulties in determining the cutoff threshold to discriminate similar documents from a set of documents. In this paper, we propose a new normalized similarity based on Weibull distribution. To test the effectiveness of the new similarity measure, we applied it in detecting similar program pairs from a set of programs. According to the experiment, the new similarity measure showed very nice characteristics in discriminating the very similar program pairs from other pairs. Also, the proposed normalized similarity is effective in detecting similar documents written in natural languages.
Keywords :
Weibull distribution; document handling; natural language processing; Weibull distribution; natural languages; normalization similarities; normalized similarities; plagiarism detection; similar document discrimination; similarity detection tools; Automatic programming; Biology computing; Clustering algorithms; Computer networks; Electronic mail; Information management; Natural languages; Plagiarism; Sequences; Weibull distribution; ICPC; Plagiarism Detection; Programming Contest; Weibull;
Conference_Titel :
Networked Computing and Advanced Information Management, 2008. NCM '08. Fourth International Conference on
Conference_Location :
Gyeongju
Print_ISBN :
978-0-7695-3322-3
DOI :
10.1109/NCM.2008.189