DocumentCode :
618863
Title :
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
Author :
Therdphapiyanak, Jakrarin ; Piromsopa, Krerk
Author_Institution :
Dept. of Comput. Eng., Chulalongkorn Univ., Bangkok, Thailand
fYear :
2013
fDate :
15-17 May 2013
Firstpage :
1
Lastpage :
6
Abstract :
In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD´99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD´99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.
Keywords :
data mining; distributed processing; pattern clustering; security of data; Apache Mahout/Hadoop framework; KDD´99 competition; ROC curve; TCPdump data set; anomaly detection model; detection rate; distributed system framework; false alarm rate table; k-means algorithm; k-means clustering; large-scale log analysis; suitable parameters; Accuracy; Algorithm design and analysis; Clustering algorithms; Distributed databases; Indexes; Intrusion detection; Partitioning algorithms; Distributed log analysis; Hadoop; IDS; Intrusion Detection System; K-means algorithm; KDD´99; Log analysis; Mahout; Security;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2013 10th International Conference on
Conference_Location :
Krabi
Print_ISBN :
978-1-4799-0546-1
Type :
conf
DOI :
10.1109/ECTICon.2013.6559650
Filename :
6559650
Link To Document :
بازگشت