Title :
Proof of duplication detection in data by applying similarity strategies
Author :
Varsha Wandhekar;Arti Mohanpurkar
Author_Institution :
Dr. D.Y. Patil School of Engg &Tech, Savitribai Phule University of Pune, MH, India
Abstract :
De-duplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various Similarity metrics are commonly used to recognize the similar field entries. So the main focus of this paper is to applying appropriate similarity measure on appropriate data to properly identifying the duplicates. De-duplication is a property which provides additional information of similarities between the two entities. Thus, in today´s data centric environment there are huge numbers of defects in similarity measure. As a result to identify the duplicates is always been a challenging task. In this paper the primary focus is given on exact identification of duplicates in the database by applying concept of windowing & blocking. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records.
Keywords :
"Databases","Standardization","Measurement","Sorting","Detection algorithms","Algorithm design and analysis","Cleaning"
Conference_Titel :
Information Processing (ICIP), 2015 International Conference on
DOI :
10.1109/INFOP.2015.7489421