• DocumentCode
    140753
  • Title

    Scalable distance-based outlier detection over high-volume data streams

  • Author

    Lei Cao ; Di Yang ; Qingyang Wang ; Yanwei Yu ; Jiayuan Wang ; Rundensteiner, E.A.

  • Author_Institution
    Worcester Polytech. Inst., Worcester, MA, USA
  • fYear
    2014
  • fDate
    March 31 2014-April 4 2014
  • Firstpage
    76
  • Lastpage
    87
  • Abstract
    The discovery of distance-based outliers from huge volumes of streaming data is critical for modern applications ranging from credit card fraud detection to moving object monitoring. In this work, we propose the first general framework to handle the three major classes of distance-based outliers in streaming environments, including the traditional distance-threshold based and the nearest-neighbor-based definitions. Our LEAP framework encompasses two general optimization principles applicable across all three outlier types. First, our “minimal probing” principle uses a lightweight probing operation to gather minimal yet sufficient evidence for outlier detection. This principle overturns the state-of-the-art methodology that requires routinely conducting expensive complete neighborhood searches to identify outliers. Second, our “lifespan-aware prioritization” principle leverages the temporal relationships among stream data points to prioritize the processing order among them during the probing process. Guided by these two principles, we design an outlier detection strategy which is proven to be optimal in CPU costs needed to determine the outlier status of any data point during its entire life. Our comprehensive experimental studies, using both synthetic as well as real streaming data, demonstrate that our methods are 3 orders of magnitude faster than state-of-the-art methods for a rich diversity of scenarios tested yet scale to high dimensional streaming data.
  • Keywords
    data handling; pattern recognition; CPU costs; LEAP framework; credit card fraud detection; distance-based outliers; distance-threshold; general framework; general optimization principles; high dimensional streaming data; high-volume data streams; lifespan-aware prioritization principle; lightweight probing operation; minimal probing principle; moving object monitoring; nearest-neighbor-based definitions; neighborhood searches; probing process; scalable distance-based outlier detection; stream data points; streaming environments; Monitoring; Optimization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2014 IEEE 30th International Conference on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/ICDE.2014.6816641
  • Filename
    6816641