• DocumentCode
    243704
  • Title

    Context-Aware Duplicate Detection in Semi-structured Data Streams

  • Author

    Shukla, Pitamber ; Somani, Arun K.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Iowa State Univ., Ames, IA, USA
  • fYear
    2014
  • fDate
    June 27 2014-July 2 2014
  • Firstpage
    216
  • Lastpage
    223
  • Abstract
    State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload- and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude.
  • Keywords
    graphics processing units; ubiquitous computing; GPU processing; context-aware duplicate detection system; data shaping; heterogeneous data sets; high velocity data streams; information theory based heuristic; scalable duplicate detection; schema-bound duplicate detection; schema-related knowledge; semistructured data streams; Computer architecture; Context; Data integration; Data models; Encoding; Time complexity; XML; GPUs; data shaping; data streams; duplicate detection; novel architectures; semi-structured data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Services (SERVICES), 2014 IEEE World Congress on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4799-5068-3
  • Type

    conf

  • DOI
    10.1109/SERVICES.2014.46
  • Filename
    6903268