• DocumentCode
    3114414
  • Title

    Clustering-based Missing Value Imputation for Data Preprocessing

  • Author

    Zhang, Chengqi ; Qin, Yongsong ; Zhu, Xiaofeng ; Zhang, Jilian ; Zhang, Shichao

  • Author_Institution
    Fac. of Inf. Technol., Univ. of Technol. Sydney, Broadway, NSW
  • fYear
    2006
  • fDate
    16-18 Aug. 2006
  • Firstpage
    1081
  • Lastpage
    1086
  • Abstract
    Missing value imputation is an actual yet challenging issue confronted by machine learning and data mining. Existing missing value imputation is a procedure that replaces the missing values in a dataset by some plausible values. The plausible values are generally generated from the dataset using a deterministic, or random method. In this paper we propose a new and efficient missing value imputation based on data clustering, called CRI (clustering-based random imputation). In our approach, we fill up the missing values of an instance with those plausible values that are generated from the data similar to this instance using a kernel-based random method. Specifically, we first divide the dataset (exclude instances with missing values) into clusters. And then each of those instances with missing-values is assigned to a cluster most similar to it. Finally, missing values of an instance A are thus patched up with those plausible values that are generated using a kernel-based method to those instances from A´s cluster. Our experiments (some of them are with the decision tree induction system C 5.0) have proved the effectiveness of our proposed method in missing value imputation task.
  • Keywords
    data mining; learning (artificial intelligence); pattern clustering; random processes; clustering-based random imputation; data clustering; data mining; data preprocessing; kernel-based random method; machine learning; missing value imputation; Australia; Computer science; Data mining; Data preprocessing; Decision trees; Induction generators; Information technology; Machine learning; Nearest neighbor searches; Stochastic processes;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Industrial Informatics, 2006 IEEE International Conference on
  • Conference_Location
    Singapore
  • Print_ISBN
    0-7803-9700-2
  • Electronic_ISBN
    0-7803-9701-0
  • Type

    conf

  • DOI
    10.1109/INDIN.2006.275767
  • Filename
    4053540