• DocumentCode
    1356699
  • Title

    Scalable feature mining for sequential data

  • Author

    Lesh, Neal ; Zaki, Mohammed J. ; Oglhara, M.

  • Author_Institution
    MERL, Cambridge, MA, USA
  • Volume
    15
  • Issue
    2
  • fYear
    2000
  • Firstpage
    48
  • Lastpage
    56
  • Abstract
    Many real world data sets contain irrelevant or redundant attributes. This might be because the data was collected without data mining in mind or without a priori knowledge of the attribute dependences. Many data mining methods such as classification and clustering degrade prediction accuracy when trained on data sets containing redundant or irrelevant attributes or features. Selecting the right feature set not only can improve accuracy but also can reduce the running time of the predictive algorithms and lead to simpler, more understandable models. Good feature selection is thus a fundamental data preprocessing step in data mining. To provide good feature selection for sequential domains, we developed FeatureMine, a scalable feature mining algorithm that combines two powerful data mining paradigms: sequence mining and classification algorithms. Tests on three practical domains demonstrate that FeatureMine can efficiently handle very large data sets with thousands of items and millions of records
  • Keywords
    classification; data mining; redundancy; very large databases; FeatureMine; attribute dependences; classification algorithms; data mining methods; data mining paradigms; data preprocessing step; feature mining algorithm; feature selection; feature set; prediction accuracy; predictive algorithms; real world data sets; redundant attributes; scalable feature mining; sequence mining; sequential data; sequential domains; very large data sets; Accuracy; Classification algorithms; Clustering algorithms; DNA; Data mining; Degradation; Prediction algorithms; Predictive models; Sequences; Testing;
  • fLanguage
    English
  • Journal_Title
    Intelligent Systems and their Applications, IEEE
  • Publisher
    ieee
  • ISSN
    1094-7167
  • Type

    jour

  • DOI
    10.1109/5254.850827
  • Filename
    850827