DocumentCode
243704
Title
Context-Aware Duplicate Detection in Semi-structured Data Streams
Author
Shukla, Pitamber ; Somani, Arun K.
Author_Institution
Dept. of Electr. & Comput. Eng., Iowa State Univ., Ames, IA, USA
fYear
2014
fDate
June 27 2014-July 2 2014
Firstpage
216
Lastpage
223
Abstract
State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload- and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude.
Keywords
graphics processing units; ubiquitous computing; GPU processing; context-aware duplicate detection system; data shaping; heterogeneous data sets; high velocity data streams; information theory based heuristic; scalable duplicate detection; schema-bound duplicate detection; schema-related knowledge; semistructured data streams; Computer architecture; Context; Data integration; Data models; Encoding; Time complexity; XML; GPUs; data shaping; data streams; duplicate detection; novel architectures; semi-structured data;
fLanguage
English
Publisher
ieee
Conference_Titel
Services (SERVICES), 2014 IEEE World Congress on
Conference_Location
Anchorage, AK
Print_ISBN
978-1-4799-5068-3
Type
conf
DOI
10.1109/SERVICES.2014.46
Filename
6903268
Link To Document