Title :
Adaptive Naive Bayesian Classifier for Automatic Classification of Webpage from Massive Network Data
Author :
Xu LinBin ; Liu Jun ; Zhou WenLi ; Yan Qing
Author_Institution :
Beijing Key Lab. of Network Syst. Archit. & Convergence, Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
This paper presents the application of Naïve Bayesian classifier to automatic classification of webpage. The key point in this article is that massive empirical data derives from the real traffic data collected from the backbone network of certain province in China, and we apply cumulative probability to determine the optimal size of feature vector adaptively. It´s proved that the adaptive method of cumulative probability threshold selection applied in this study has good robustness. This paper focus on four feature selection methods: TF-IDF (term frequency-inverse document frequency), IG (Information Gain), MOR (Multi-class Odds Ratio), CDM (Class Discriminating Measure). We find that Naïve Bayesian classifier performs fairly well in speed and precision on big data sets, whose precision, recall and F1 metric are all above 90% in all 6 categories of webpage.
Keywords :
Bayes methods; Big Data; Web sites; document handling; feature selection; pattern classification; Big Data sets; CDM; China; IG; MOR; TF-IDF; Webpage automatic classification; adaptive naive Bayesian classifier; backbone network; class discriminating measure; cumulative probability threshold selection; feature selection methods; feature vector; information gain; massive empirical data; massive network data; multiclass odds ratio; term frequency-inverse document frequency; Bayes methods; Games; Market research; Measurement; Robustness; Training; Vectors; adaptive threshold selection; big data; naïve bayes; robustness; webpage classification;
Conference_Titel :
Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2014 Sixth International Conference on
Conference_Location :
Hangzhou
Print_ISBN :
978-1-4799-4956-4
DOI :
10.1109/IHMSC.2014.39