مرکز منطقه ای اطلاع رساني علوم و فناوري - Detecting unique column combinations on dynamic data

DocumentCode :

140912

Title :

Detecting unique column combinations on dynamic data

Author :

Abedjan, Ziawasch ; Quiane-Ruiz, Jorge-Arnulfo ; Naumann, Felix

Author_Institution :

Hasso Plattner Inst. (HPI), Potsdam, Germany

fYear :

2014

fDate :

March 31 2014-April 4 2014

Firstpage :

1036

Lastpage :

1047

Abstract :

The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts.

Keywords :

data mining; Ducc technique; Gordian technique; Swan technique; data profiling techniques; dataset discovery; dynamic dataset; relational dataset; scientific applications; social networks; static dataset; transactional databases; unique column combinations detection; Data models; Data structures; Heuristic algorithms; Indexing; Proteins; Query processing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Engineering (ICDE), 2014 IEEE 30th International Conference on

Conference_Location :

Chicago, IL

Type :

conf

DOI :

10.1109/ICDE.2014.6816721

Filename :

6816721

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=140912