• DocumentCode
    1646026
  • Title

    Data mining the PIMA dataset using rough set theory with a special emphasis on rule reduction

  • Author

    Khan, Aurangieb ; Revett, Kenneth

  • Author_Institution
    Dept. of CIS, Luton Univ., UK
  • fYear
    2004
  • Firstpage
    334
  • Lastpage
    339
  • Abstract
    This paper describes how rough set theory can be utilized as a tool for analyzing relatively complex decision tables like the Pima Indian Diabetes Database (PIDD). We utilized Rosetta, a public domain implementation of rough sets on the PIDD in order to determine how we could generate a predictive rule set with the highest accuracy and the fewest number of rules. Having a reduced rule set is advantageous as it provides focus on the salient attributes and makes application in clinical practice more efficient (and likely). In this paper, we report the use of a genetic algorithm based rough set approach to classification of diabetes and achieved a success rate on the test set of 83%. This classification accuracy favors highly compared to other reported results, which ranged from 65% to 75%. In addition, we were able to achieve this accuracy with less than 100 rules. The high accuracy and low rule number provides support to the use of rough sets as a data mining tool in biological databases.
  • Keywords
    biology computing; data mining; database management systems; genetic algorithms; rough set theory; Pima Indian Diabetes Database; biological databases; data mining; genetic algorithm; predictive rule set; rough set theory; rule reduction; Computational Intelligence Society; Data mining; Databases; Diseases; Genetics; Medical diagnostic imaging; Neural networks; Rough sets; Set theory; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multitopic Conference, 2004. Proceedings of INMIC 2004. 8th International
  • Print_ISBN
    0-7803-8680-9
  • Type

    conf

  • DOI
    10.1109/INMIC.2004.1492899
  • Filename
    1492899