DocumentCode
2768605
Title
Accurate SVM Text Classification for Highly Skewed Data Using Threshold Tuning and Query-Expansion-Based Feature Selection
Author
Goertzel, Ben ; Venuto, James
Author_Institution
Virginia Tech´´s Nat. Capital Operation, Arlington
fYear
0
fDate
0-0 0
Firstpage
1220
Lastpage
1225
Abstract
A novel technique is described, wherein Support Vector Machines are used to perform relatively effective text categorization based on small numbers of positive examples (fewer than 10 in some cases). It is assumed that in addition to the positive examples a query describing the positive category is given (in the form of a set of key phrases or a sentence). The technique combines two innovations: a special way of altering the SVM score threshold based on looking at the distribution of scores across the training set; and, a method of feature selection that involves retaining only features that display semantic association to the content words in the query (according to a word-association database produced by statistical analysis of a parsed corpus). Examples are given on a number of test cases drawn from the Reuters and FBIS news archives.
Keywords
pattern classification; query processing; support vector machines; text analysis; FBIS news archive; Reuters news archive; SVM; feature selection; highly skewed data; query-expansion; semantic association; support vector machines; text categorization; text classification; threshold tuning; training set; Art; Displays; Image classification; Spatial databases; Statistical analysis; Support vector machine classification; Support vector machines; Technological innovation; Testing; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Neural Networks, 2006. IJCNN '06. International Joint Conference on
Conference_Location
Vancouver, BC
Print_ISBN
0-7803-9490-9
Type
conf
DOI
10.1109/IJCNN.2006.246830
Filename
1716241
Link To Document