DocumentCode :
268086
Title :
A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning
Author :
Garcia, Sergio ; Luengo, J. ; Sáez, José Antonio ; López, Victor ; Herrera, Francisco
Author_Institution :
Dept. of Comput. Sci., Univ. of Jaen, Jaen, Spain
Volume :
25
Issue :
4
fYear :
2013
fDate :
Apr-13
Firstpage :
734
Lastpage :
750
Abstract :
Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers, and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals, and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.
Keywords :
data mining; decision trees; learning (artificial intelligence); pattern classification; statistical analysis; categorical values; continuous attributes; data mining tasks; decision trees; discretization methods; discretization techniques; knowledge discovery; nonparametric statistical tests; qualitative data; quantitative data; supervised classification; supervised learning; symbolic data mining algorithms; Algorithm design and analysis; Delta modulation; Electronic mail; Heuristic algorithms; Merging; Supervised learning; Taxonomy; Discretization; classification; continuous attributes; data mining; data preprocessing; decision trees; taxonomy;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2012.35
Filename :
6152258
Link To Document :
بازگشت