• DocumentCode
    1640656
  • Title

    Data labeling method based on cluster purity using relative rough entropy for categorical data clustering

  • Author

    Reddy, H. Venkateswara ; Viswanadha Raju, S. ; Agrawal, Pulin

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Vardhaman Coll. of Eng., Hyderabad, India
  • fYear
    2013
  • Firstpage
    500
  • Lastpage
    506
  • Abstract
    Clustering is an important technique in data mining. Clustering a large data set is difficult and time consuming. An approach called data labeling has been suggested for clustering large databases using sampling technique to improve efficiency of clustering. A sampled data is selected randomly for initial clustering and data points which are not sampled and unclustered are given cluster label or an outlier based on various data labeling techniques. Data labeling is an easy task in numerical domain because it is performed based on distance between a cluster and an unlabeled data point. However, in categorical domain since the distance is not defined properly between data points and between data point with cluster, then data labeling is a difficult task for categorical data. In this paper, we have proposed a method for data labeling using Relative Rough Entropy for clustering categorical data. The concept of entropy, introduced by Shannon with particular reference to information theory is a powerful mechanism for the measurement of uncertainty information. In this method, data labeling is performed by integrating entropy with rough sets. In this paper, the cluster purity is also used for outlier detection. The experimental results show that the efficiency and clustering quality of this algorithm are better than the previous algorithms.
  • Keywords
    entropy; pattern clustering; rough set theory; very large databases; categorical data clustering; categorical domain; cluster purity; clustering efficiency; clustering quality; clustering technique; data labeling method; data mining; entropy concept; large database clustering; numerical domain; outlier detection; relative rough entropy; sampling technique; Clustering algorithms; Entropy; Information systems; Labeling; Set theory; Time complexity; Uncertainty; Categorical Data; Cluster Purity; Clustering; Data Labeling; Entropy; Outlier; Rough set;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Computing, Communications and Informatics (ICACCI), 2013 International Conference on
  • Conference_Location
    Mysore
  • Print_ISBN
    978-1-4799-2432-5
  • Type

    conf

  • DOI
    10.1109/ICACCI.2013.6637222
  • Filename
    6637222