Data mining the PIMA dataset using rough set theory with a special emphasis on rule reduction

Author

Khan, Aurangieb ; Revett, Kenneth

Author_Institution

Dept. of CIS, Luton Univ., UK

fYear

2004

Firstpage

334

Lastpage

339

Abstract

This paper describes how rough set theory can be utilized as a tool for analyzing relatively complex decision tables like the Pima Indian Diabetes Database (PIDD). We utilized Rosetta, a public domain implementation of rough sets on the PIDD in order to determine how we could generate a predictive rule set with the highest accuracy and the fewest number of rules. Having a reduced rule set is advantageous as it provides focus on the salient attributes and makes application in clinical practice more efficient (and likely). In this paper, we report the use of a genetic algorithm based rough set approach to classification of diabetes and achieved a success rate on the test set of 83%. This classification accuracy favors highly compared to other reported results, which ranged from 65% to 75%. In addition, we were able to achieve this accuracy with less than 100 rules. The high accuracy and low rule number provides support to the use of rough sets as a data mining tool in biological databases.

Keywords

biology computing; data mining; database management systems; genetic algorithms; rough set theory; Pima Indian Diabetes Database; biological databases; data mining; genetic algorithm; predictive rule set; rough set theory; rule reduction; Computational Intelligence Society; Data mining; Databases; Diseases; Genetics; Medical diagnostic imaging; Neural networks; Rough sets; Set theory; Testing;

fLanguage

English

Publisher

ieee

Conference_Titel

Multitopic Conference, 2004. Proceedings of INMIC 2004. 8th International

Print_ISBN

0-7803-8680-9

Type

conf

DOI

10.1109/INMIC.2004.1492899

Filename

1492899