Title :
An improved naive Bayesian classifier technique coupled with a novel input solution method [rainfall prediction]
Author :
Liu, James N K ; Li, Bavy N L ; Dillon, Tharam S.
Author_Institution :
Dept. of Comput., Hong Kong Polytech. Univ., Hung Hom, China
fDate :
5/1/2001 12:00:00 AM
Abstract :
Data mining is the study of how to determine underlying patterns in the data to help make optimal decisions on computers when the database involved is voluminous, hard to characterize accurately and constantly changing. It deploys techniques based on machine learning alongside more conventional methods. These techniques can generate decision or prediction models based on actual historical data. Therefore, they represent true evidence-based decision support. Rainfall prediction is a good problem to solve by data mining techniques. This paper proposes an improved naive Bayes classifier (INCB) technique and explores the use of genetic algorithms (GAs) for the selection of a subset of input features in classification problems. It then carries out a comparison with several other techniques. It compares the following algorithms on real meteorological data in Hong Kong: (1) genetic algorithms with average classification or general classification (GA-AC and GA-C), (2) C4.5 with pruning, and (3) INBC with relative frequency or initial probability density (INBC-RF and INBC-IPD). Two simple schemes are proposed to construct a suitable data set for improving their performance. Scheme I uses all the basic input parameters for rainfall prediction. Scheme II uses the optimal subset of input variables which are selected by a GA. The results show that, among the methods we compared, INBC achieved about a 90% accuracy rate on the rain/no-rain classification problems. This method also attained reasonable performance on rainfall prediction with three-level depth and five-level depth, which are around 65%-70%
Keywords :
Bayes methods; data mining; forecasting theory; genetic algorithms; geophysics computing; learning (artificial intelligence); pattern classification; probability; rain; software performance evaluation; temporal databases; very large databases; weather forecasting; 3-level depth; 5-level depth; C4.5; Hong Kong; accuracy; algorithm performance; average classification; constantly changing data; data mining; decision models; evidence-based decision support; general classification; genetic algorithms; historical data; improved naive Bayesian classifier; initial probability density; input feature subset selection; input parameters; input solution method; large database; machine learning; meteorological data; optimal decisions; optimal input variables subset; prediction models; pruning; rainfall prediction; relative frequency; underlying pattern determination; Bayesian methods; Data mining; Databases; Frequency; Genetic algorithms; Input variables; Machine learning; Meteorology; Predictive models; Rain;
Journal_Title :
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
DOI :
10.1109/5326.941848