Proof of duplication detection in data by applying similarity strategies

Author

Varsha Wandhekar;Arti Mohanpurkar

Author_Institution

Dr. D.Y. Patil School of Engg &Tech, Savitribai Phule University of Pune, MH, India

fYear

2015

Firstpage

429

Lastpage

434

Abstract

De-duplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various Similarity metrics are commonly used to recognize the similar field entries. So the main focus of this paper is to applying appropriate similarity measure on appropriate data to properly identifying the duplicates. De-duplication is a property which provides additional information of similarities between the two entities. Thus, in today´s data centric environment there are huge numbers of defects in similarity measure. As a result to identify the duplicates is always been a challenging task. In this paper the primary focus is given on exact identification of duplicates in the database by applying concept of windowing & blocking. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records.

Keywords

"Databases","Standardization","Measurement","Sorting","Detection algorithms","Algorithm design and analysis","Cleaning"

Publisher

ieee

Conference_Titel

Information Processing (ICIP), 2015 International Conference on

Type

conf

DOI

10.1109/INFOP.2015.7489421

Filename

7489421