• DocumentCode
    3110504
  • Title

    Efficient cross validation over skewed noisy data

  • Author

    Dash, Manoranjan ; Hao, Oh Yin

  • Author_Institution
    Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore
  • fYear
    2008
  • fDate
    12-15 Oct. 2008
  • Firstpage
    749
  • Lastpage
    756
  • Abstract
    Cross-validation (CV), which is widely used in classification problems, gives a very good estimate of the prediction accuracy of a classifier over unseen data. Thus, any improvement in the accuracy estimation of the cross-validation method will benefit a lot of people and help in improving the results of many researches. In this paper the focus is on skewed noisy datasets. Applications such as fraud detection is an important example of skewed data. Usually for CV, simple random sampling (SRS) is performed to divide the data into the required number of folds, e.g., 10-fold CV requires the data to be divided into 10 folds. SRS is known to give poor performance (accuracy of classification) when data is skewed. We propose a new algorithm, based on the frequency histogram of each attribute value, to divide the dataset into the required number of folds. In this project, the effectiveness of the proposed algorithm vis-a-vis SRS is tested with datasets from the UCI machine learning repository. The results show that the proposed algorithm is significantly better in handling noisy skewed data.
  • Keywords
    data mining; learning (artificial intelligence); random processes; cross-validation method; fraud detection; frequency histogram; machine learning; simple random sampling; skewed noisy data; Accuracy; Data engineering; Data mining; Decision trees; Frequency conversion; Histograms; Machine learning; Machine learning algorithms; Sampling methods; Testing; Sampling; classification; cross validation; noise (key words); skewed data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on
  • Conference_Location
    Singapore
  • ISSN
    1062-922X
  • Print_ISBN
    978-1-4244-2383-5
  • Electronic_ISBN
    1062-922X
  • Type

    conf

  • DOI
    10.1109/ICSMC.2008.4811368
  • Filename
    4811368