• DocumentCode
    806230
  • Title

    Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing

  • Author

    Yan, Jun ; Zhang, Benyu ; Liu, Ning ; Yan, Shuicheng ; Cheng, Qiansheng ; Fan, Weiguo ; Yang, Qiang ; Xi, Wensi ; Chen, Zheng

  • Author_Institution
    Dept. of Inf. Sci., Peking Univ., Beijing, China
  • Volume
    18
  • Issue
    3
  • fYear
    2006
  • fDate
    3/1/2006 12:00:00 AM
  • Firstpage
    320
  • Lastpage
    333
  • Abstract
    Dimensionality reduction is an essential data preprocessing technique for large-scale and streaming data classification tasks. It can be used to improve both the efficiency and the effectiveness of classifiers. Traditional dimensionality reduction approaches fall into two categories: feature extraction and feature selection. Techniques in the feature extraction category are typically more effective than those in feature selection category. However, they may break down when processing large-scale data sets or data streams due to their high computational complexities. Similarly, the solutions provided by the feature selection approaches are mostly solved by greedy strategies and, hence, are not ensured to be optimal according to optimized criteria. In this paper, we give an overview of the popularly used feature extraction and selection algorithms under a unified framework. Moreover, we propose two novel dimensionality reduction algorithms based on the orthogonal centroid algorithm (OC). The first is an incremental OC (IOC) algorithm for feature extraction. The second algorithm is an orthogonal centroid feature selection (OCFS) method which can provide optimal solutions according to the OC criterion. Both are designed under the same optimization criterion. Experiments on Reuters Corpus Volume-1 data set and some public large-scale text data sets indicate that the two algorithms are favorable in terms of their effectiveness and efficiency when compared with other state-of-the-art algorithms.
  • Keywords
    data analysis; data mining; feature extraction; optimisation; pattern classification; very large databases; Reuters Corpus Volume-1 data set; data classification tasks; data preprocessing; data streaming; dimensionality reduction algorithms; feature extraction; incremental orthogonal centroid algorithm; optimization; orthogonal centroid feature selection; public large-scale text data sets; Computational complexity; Data mining; Data preprocessing; Design optimization; Feature extraction; Information processing; Iron; Large-scale systems; Linear discriminant analysis; Principal component analysis; Index Terms- Feature extraction; feature selection; orthogonal centroid algorithm.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.45
  • Filename
    1583582