• DocumentCode
    395317
  • Title

    A mixture model and EM algorithm for robust classification, outlier rejection, and class discovery

  • Author

    Miller, David J. ; Browning, John

  • Author_Institution
    Dept. of Electr. Eng., Pennsylvania State Univ., University Park, PA, USA
  • Volume
    2
  • fYear
    2003
  • fDate
    6-10 April 2003
  • Abstract
    Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predefined or to heretofore undiscovered classes. There are several practical situations where such data may arise. We propose a novel statistical mixture model which views as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. "Non-predefined" components only generate unlabeled points-thus, in localized regions, they capture data subsets that are exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components\´ predefined/non-predefined natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: (1) robust classifier design, given a mixed training set with outliers; (2) classification with rejections; (3) identification of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. We evaluate our method and alternative approaches on both synthetic and real-world data sets.
  • Keywords
    identification; learning (artificial intelligence); optimisation; signal classification; statistical analysis; EM algorithm; class discovery; class label; data-driven components; exclusively unlabeled data subsets; expectation-maximization algorithm; feature vector; label presence/absence; localized regions; mixed labeled/unlabeled training set; mixed training set; mixture model; new class discovery; observed data; outlier distribution; outlier rejection; real-world data sets; robust classification; robust classifier design; statistical classifier learning; statistical mixture model; synthetic data sets; undiscovered classes; unlabeled points identification; Character recognition; Classification algorithms; Databases; Humans; Labeling; Maximum likelihood estimation; Robustness; Uncertainty;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-7663-3
  • Type

    conf

  • DOI
    10.1109/ICASSP.2003.1202490
  • Filename
    1202490