A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication

Author

Dal Bianco, Guilherme ; Galante, Renata ; GonÃ§alves, Marcos AndreÌ ; Canuto, Sergio ; Heuser, Carlos A.

Author_Institution

SENAI Coll. of Technol., Av. Assis Brasil, Assis, Brazil

Volume

27

Issue

9

fYear

2015

fDate

Sept. 1 2015

Firstpage

2305

Lastpage

2319

Abstract

The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. The information provided by the user to tune the deduplication process is usually represented by a set of manually labeled pairs. In very large datasets, producing this kind of labeled set is a daunting task since it requires an expert to select and label a large number of informative pairs. In this article, we propose a two-stage sampling selection strategy (T3S) that selects a reduced set of pairs to tune the deduplication process in large datasets. T3S selects the most representative pairs by following two stages. In the first stage, we propose a strategy to produce balanced subsets of candidate pairs for labeling. In the second stage, an active selection is incrementally invoked to remove the redundant pairs in the subsets created in the first stage in order to produce an even smaller and more informative training set. This training set is effectively used both to identify where the most ambiguous pairs lie and to configure the classification approaches. Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality when compared with state-of-the-art deduplication methods in large datasets.

Keywords

data analysis; pattern classification; sampling methods; T3S; classification approach; data deduplication task; deduplication process; informative training set; large scale deduplication; research community; sampling selection strategy; two-stage sampling selection strategy; very large datasets; Communities; Electronic mail; Indexes; Labeling; Learning systems; Training; Vectors; Deduplication; Deduplication,; Signature-based Deduplication; signature-based deduplication;

fLanguage

English

Journal_Title

Knowledge and Data Engineering, IEEE Transactions on

Publisher

ieee

ISSN

1041-4347

Type

jour

DOI

10.1109/TKDE.2015.2416734

Filename

7070725

A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication

Dal Bianco, Guilherme ; Galante, Renata ; GonÃ§alves, Marcos AndreÌ ; Canuto, Sergio ; Heuser, Carlos A.

jour

Dal Bianco, Guilherme ; Galante, Renata ; GonÃ§alves, Marcos AndreÌ ; Canuto, Sergio ; Heuser, Carlos A.