DocumentCode :
1125998
Title :
A discretization algorithm based on a heterogeneity criterion
Author :
Liu, Xiaoyan ; Wang, Huaiqing
Author_Institution :
Dept. of Inf. Syst., City Univ. of Hong Kong, Kowloon, China
Volume :
17
Issue :
9
fYear :
2005
Firstpage :
1166
Lastpage :
1173
Abstract :
Discretization, as a preprocessing step for data mining, is a process of converting the continuous attributes of a data set into discrete ones so that they can be treated as the nominal features by machine learning algorithms. Those various discretization methods, that use entropy-based criteria, form a large class of algorithm. However, as a measure of class homogeneity, entropy cannot always accurately reflect the degree of class homogeneity of an interval. Therefore, in this paper, we propose a new measure of class heterogeneity of intervals from the viewpoint of class probability itself. Based on the definition of heterogeneity, we present a new criterion to evaluate a discretization scheme and analyze its property theoretically. Also, a heuristic method is proposed to find the approximate optimal discretization scheme. Finally, our method is compared, in terms of predictive error rate and tree size, with Ent-MDLC, a representative entropy-based discretization method well-known for its good performance. Our method is shown to produce better results than those of Ent-MDLC, although the improvement is not significant. It can be a good alternative to entropy-based discretization methods.
Keywords :
data analysis; data mining; heuristic programming; learning (artificial intelligence); probability; very large databases; Ent-MDLC; class homogeneity; class probability; data mining; data preparation; discretization algorithm; entropy-based discretization method; heterogeneity criterion; heuristic method; machine learning algorithm; predictive error rate; Computer Society; Data mining; Decision trees; Discrete transforms; Entropy; Error analysis; Frequency conversion; Machine learning; Machine learning algorithms; Spatial databases; Index Terms- Data mining; data preparation; discretization; entropy; heterogeneity.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2005.135
Filename :
1490524
Link To Document :
بازگشت