• DocumentCode
    2768605
  • Title

    Accurate SVM Text Classification for Highly Skewed Data Using Threshold Tuning and Query-Expansion-Based Feature Selection

  • Author

    Goertzel, Ben ; Venuto, James

  • Author_Institution
    Virginia Tech´´s Nat. Capital Operation, Arlington
  • fYear
    0
  • fDate
    0-0 0
  • Firstpage
    1220
  • Lastpage
    1225
  • Abstract
    A novel technique is described, wherein Support Vector Machines are used to perform relatively effective text categorization based on small numbers of positive examples (fewer than 10 in some cases). It is assumed that in addition to the positive examples a query describing the positive category is given (in the form of a set of key phrases or a sentence). The technique combines two innovations: a special way of altering the SVM score threshold based on looking at the distribution of scores across the training set; and, a method of feature selection that involves retaining only features that display semantic association to the content words in the query (according to a word-association database produced by statistical analysis of a parsed corpus). Examples are given on a number of test cases drawn from the Reuters and FBIS news archives.
  • Keywords
    pattern classification; query processing; support vector machines; text analysis; FBIS news archive; Reuters news archive; SVM; feature selection; highly skewed data; query-expansion; semantic association; support vector machines; text categorization; text classification; threshold tuning; training set; Art; Displays; Image classification; Spatial databases; Statistical analysis; Support vector machine classification; Support vector machines; Technological innovation; Testing; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Neural Networks, 2006. IJCNN '06. International Joint Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    0-7803-9490-9
  • Type

    conf

  • DOI
    10.1109/IJCNN.2006.246830
  • Filename
    1716241