Title :
An optimized features extraction algorithm on VSM
Author :
Kui Fang ; Juan Wang
Author_Institution :
Coll. of Inf. Sci. & Technol., Hunan Agric. Univ., Changsha, China
Abstract :
VSM (Vector Space Model) is one of the important methods for describing documents. However, in the process of information representation, features are always high dimensional. So feature extraction technologies have to be used to reduce dimensions. At present, there are lots of feature extraction algorithms, in which TF-IDF,TF-IDF-IG are used widely in practice. However, as the two didn´t consider the influence of text categories and the structure of HTML sufficiently, which greatly affects the accuracy and applicability of the algorithms. To this issue, we proposed an optimized feature extraction algorithm. Meanwhile, we introduced a modifying factor into the novel algorithm to avoid the data imbalance problem which results from magnitude of categories. Through the experiment, the proposed algorithm was compared with the TF-IDF and TF-IDF-IG. We found that the precision and recall of the new algorithm are separately increased more than 10.4% and 13.8% than TF-IDF, and 4.6% and 2.9% than TF-IDF-IG, which shows the novel algorithm has better precision and recall.
Keywords :
document handling; feature extraction; information retrieval; vectors; TF-IDF algorithm; TF-IDF-IG algorithm; VSM; data imbalance problem avoidance; dimension reduction; documents representation; information representation process; information retrieval; optimized feature extraction algorithm; vector space model; Algorithm design and analysis; Classification algorithms; Educational institutions; Feature extraction; HTML; Information processing; Text categorization; TF-IDF; TF-IDF-IG; features extraction;
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
Conference_Location :
Sichuan
Print_ISBN :
978-1-4673-0025-4
DOI :
10.1109/FSKD.2012.6233810