Title :
ASPDD: An Efficient and Scalable Framework for Duplication Detection
Author :
Latha, K. ; Rajmohan, B. ; Rajaram, R.
Author_Institution :
Comput. Sci. & Eng. Dept., Anna Univ. Tiruchirappalli, Tiruchirappalli, India
Abstract :
This paper introduces a framework for duplicate document detection problem that uses an efficient dynamic program called All Pairs Shortest Path in the text collection. Our goal in this work is to investigate the phenomenon and determine the approach that minimizes the impact of duplicates on search results. We show that our approach scales in terms of the number of documents and works well for documents of all domains. We compared our solution to the state of the art and found that our method has produced promising results in addition to improved accuracy of exact duplicate detection, it has also detected partial and neighbor replica. The robustness of the above techniques is demonstrated through a set of experiments using data reflecting real-world degradation effects.
Keywords :
Character generation; Computer science; Costs; Degradation; Educational institutions; Fingerprint recognition; Information technology; Robustness; Sorting; Wildlife; All Pairs Shortest Path; Degradation Effects; Duplicate Document Detection; Neighbor Replica; Partial Replica;
Conference_Titel :
Advances in Computer Engineering (ACE), 2010 International Conference on
Conference_Location :
Bangalore, Karnataka, India
Print_ISBN :
978-1-4244-7154-6
DOI :
10.1109/ACE.2010.61