Title :
Spark-based anomaly detection over multi-source VMware performance data in real-time
Author :
Solaimani, Mohiuddin ; Iftekhar, Mohammed ; Khan, Latifur ; Thuraisingham, Bhavani ; Ingram, Joey Burton
Author_Institution :
Dept. of Comput. Sci., Univ. of Texas at Dallas, Richardson, TX, USA
Abstract :
Anomaly detection refers to identifying the patterns in data that deviate from expected behavior. These non-conforming patterns are often termed as outliers, malwares, anomalies or exceptions in different application domains. This paper presents a novel, generic real-time distributed anomaly detection framework for multi-source stream data. As a case study, we have decided to detect anomaly for multi-source VMware-based cloud data center. The framework monitors VMware performance stream data (e.g., CPU load, memory usage, etc.) continuously. It collects these data simultaneously from all the VMwares connected to the network. It notifies the resource manager to reschedule its resources dynamically when it identifies any abnormal behavior of its collected data. We have used Apache Spark, a distributed framework for processing performance stream data and making prediction without any delay. Spark is chosen over a traditional distributed framework (e.g., Hadoop and MapReduce, Mahout, etc.) that is not ideal for stream data processing. We have implemented a flat incremental clustering algorithm to model the benign characteristics in our distributed Spark based framework. We have compared the average processing latency of a tuple during clustering and prediction in Spark with Storm, another distributed framework for stream data processing. We experimentally find that Spark processes a tuple much quicker than Storm on average.
Keywords :
cloud computing; data handling; pattern clustering; resource allocation; security of data; system monitoring; virtual machines; Apache Spark; CPU load; Spark-based anomaly detection; Storm; VMware performance stream data monitoring; abnormal behavior identification; application domain; data collection; data pattern identification; distributed framework; dynamic resource rescheduling; exceptions; flat incremental clustering algorithm; generic real-time distributed anomaly detection framework; malwares; memory usage; multisource VMware performance data; multisource VMware-based cloud data center; multisource stream data; nonconforming patterns; outliers; performance stream data processing; processing latency; resource manager; Clustering algorithms; Data models; Dynamic scheduling; Predictive models; Real-time systems; Sparks; Training; Anomaly detection; Data center; Incremental clustering; Real-time anomaly detection; Resource scheduling;
Conference_Titel :
Computational Intelligence in Cyber Security (CICS), 2014 IEEE Symposium on
Conference_Location :
Orlando, FL
DOI :
10.1109/CICYBS.2014.7013369