• DocumentCode
    1421799
  • Title

    Trace-Oriented Feature Analysis for Large-Scale Text Data Dimension Reduction

  • Author

    Yan, Jun ; Liu, Ning ; Yan, Shuicheng ; Yang, Qiang ; Fan, Weiguo ; Wei, Wei ; Chen, Zheng

  • Author_Institution
    Sigma Center, Microsoft Res. Asia, Beijing, China
  • Volume
    23
  • Issue
    7
  • fYear
    2011
  • fDate
    7/1/2011 12:00:00 AM
  • Firstpage
    1103
  • Lastpage
    1117
  • Abstract
    Dimension reduction for large-scale text data is attracting much attention nowadays due to the rapid growth of the World Wide Web. We can categorize those popular dimension reduction algorithms into two groups: feature extraction and feature selection algorithms. In the former, new features are combined from their original features through algebraic transformation. Though many of them have been validated to be effective, these algorithms are typically associated with high computational overhead, making them difficult to be applied on real-world text data. In the latter, subsets of features are selected directly. These algorithms are widely used in real-world tasks owing to their efficiency, but are often based on greedy strategies rather than optimal solutions. An important problem remains: it has been troublesome to integrate these two types of algorithms into a single framework, making it difficult to reap the benefits from both. In this paper, we formulate the two algorithm categories through a unified optimization framework, under which we develop a novel feature selection algorithm called Trace-Oriented Feature Analysis (TOFA). In detail, we integrate the objective functions of several state-of-the-art feature extraction algorithms into a unified one under the optimization framework, and then we propose to optimize this objective function in the solution space of feature selection algorithms for dimensionality reduction. Since the proposed objective function of TOFA integrates many prominent feature extraction algorithms´ objective functions, such as unsupervised Principal Component Analysis (PCA) and supervised Maximum Margin Criterion (MMC), TOFA can handle both supervised and unsupervised problems. In addition, by tuning a weight value, TOFA is also suitable to solve semisupervised learning problems. Experimental results on several real-world data sets validate the effectiveness and efficiency of TOFA in text data for dimensionality reduction purpose.
  • Keywords
    data reduction; feature extraction; learning (artificial intelligence); principal component analysis; text analysis; World Wide Web; algebraic transformation; computational overhead; feature extraction algorithm; feature selection algorithm; greedy strategy; large-scale text data dimension reduction; semisupervised learning problem; supervised maximum margin criterion; trace-oriented feature analysis; unified optimization framework; unsupervised principal component analysis; Algorithm design and analysis; Clustering algorithms; Feature extraction; Information analysis; Large-scale systems; Principal component analysis; Semisupervised learning; Text analysis; Text processing; Web sites; Algebraic algorithms; computations on matrices; document analysis; global optimization.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.34
  • Filename
    5416720