• DocumentCode
    3165153
  • Title

    Clustering Needles in a Haystack: An Information Theoretic Analysis of Minority and Outlier Detection

  • Author

    Ando, Shin

  • Author_Institution
    Yokohama Nat. Univ., Yokohama
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    13
  • Lastpage
    22
  • Abstract
    Identifying atypical objects is one of the traditional topics in machine learning. Recently, novel approaches, e.g., Minority Detection and One-class clustering, have explored further to identify clusters of atypical objects which strongly contrast from the rest of the data in terms of their distribution or density. This paper analyzes such tasks from an information theoretic perspective. Based on Information Bottleneck formalization, these tasks interpret to increasing the averaged atypicalness of the clusters while reducing the complexity of the clustering. This formalization yields a unifying view of the new approaches as well as the classic outlier detection. We also present a scalable minimization algorithm which exploits the localized form of the cost function over individual clusters. The proposed algorithm is evaluated using simulated datasets and a text classification benchmark, in comparison with an existing method.
  • Keywords
    learning (artificial intelligence); object detection; pattern classification; information bottleneck formalization; information theoretic analysis; machine learning; minority detection; needles clustering; one-class clustering; scalable minimization algorithm; simulated datasets; text classification; Clustering algorithms; Cost function; Data mining; Information analysis; Machine learning; Machine learning algorithms; Needles; Object detection; Rate distortion theory; Unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.53
  • Filename
    4470225