مرکز منطقه ای اطلاع رساني علوم و فناوري - Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers

DocumentCode :

2008821

Title :

Comparative Analysis of the Impact of Discretization on the Classification with Naïve Bayes and Semi-Naïve Bayes Classifiers

Author :

Mizianty, Marcin ; Kurgan, Lukasz ; Ogiela, Marek

Author_Institution :

Fac. of Phys. & Appl. Comput. Sci., AGH Univ. of Sci. & Technol., Krakow, Poland

fYear :

2008

fDate :

11-13 Dec. 2008

Firstpage :

823

Lastpage :

828

Abstract :

While data could be discrete and continuous (defined as ordinal numerical features), some classifiers, like Naive Bayes (NB), work only with or may perform better with the discrete data. We focus on NB due to its popularity and linear training time. We investigate the impact of eight discretization algorithms (Equal Width, Equal Frequency, Maximum Entropy, IEM, CADD, CAIM, MODL, and CACC) on the classification with NB and two modern semi-NB classifiers, LBR and AODE.Our comprehensive empirical study indicates that unsupervised discretization algorithms are the fastest while among the supervised algorithms the fastest is maximum entropy, followed by CAIM and IEM. The CAIM and MODL discretizers generate the lowest and the highest number of discrete values, respectively.We compare the time to build the classification model and classification accuracy when using raw and discretized data. We show that discretization helps to improve the classification with the NB when compared with flexible NB which models continuous features using Gaussian kernels. The AODE classifier obtains on average the best accuracy, while the best performing setup includes discretization with IEM and classification with AODE. The runner-up setups include CAIM and CACC coupled with AODE and CAIM and IEM coupled with LBR. IEM and CAIM are shown to provide statistically significant improvements across all considered datasets for LBR and AODE classifiers when compared with using NB on the continuous data. We also show that the improved accuracy comes at the trade-off of substantially increased runtime.

Keywords :

Bayes methods; Gaussian processes; entropy; knowledge based systems; optimisation; pattern classification; unsupervised learning; Gaussian kernel; aggregating one-dependence estimator; class-attribute contingency coefficient; class-attribute dependency discretization algorithm; class-attribute interdependence maximization algorithm; discretized data classification model; information entropy maximization algorithm; lazy Bayes rule based classifier; linear training time; semi Naive Bayes classifier; unsupervised discretization algorithm; Application software; Classification tree analysis; Computer science; Decision trees; Entropy; Frequency; Machine learning; Niobium; Performance analysis; Physics; CACC; CADD; CAIM; Discretization; Equal Frequency; Equal Width; IEM; MODL; Maximum Entropy; accuracy; aode; classification; continuous features; lbr; naive bayes; runtime; supervised discretization; unsupervised discretization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on

Conference_Location :

San Diego, CA

Print_ISBN :

978-0-7695-3495-4

Type :

conf

DOI :

10.1109/ICMLA.2008.29

Filename :

4725074

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2008821