• Title of article

    Using chi-square statistics to measure similarities for text categorization

  • Author/Authors

    Chen، نويسنده , , Yao-Tsung and Chen، نويسنده , , Meng Chang، نويسنده ,

  • Issue Information
    روزنامه با شماره پیاپی سال 2011
  • Pages
    6
  • From page
    3085
  • To page
    3090
  • Abstract
    In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF ∗ IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.
  • Keywords
    nonparametric statistics , Text Mining , Machine Learning
  • Journal title
    Expert Systems with Applications
  • Serial Year
    2011
  • Journal title
    Expert Systems with Applications
  • Record number

    2348959