DocumentCode :
3334611
Title :
Enhancing Text Analysis via Dimensionality Reduction
Author :
Underhill, David G. ; McDowell, Luke K. ; Marchette, David J. ; Solka, Jeffrey L.
Author_Institution :
U.S. Naval Acad., Annapolis
fYear :
2007
fDate :
13-15 Aug. 2007
Firstpage :
348
Lastpage :
353
Abstract :
Many applications require analyzing vast amounts of textual data, but the size and inherent noise of such data can make processing very challenging. One approach to these issues is to mathematically reduce the data so as to represent each document using only a few dimensions. Techniques for performing such "dimensionality reduction " (DR) have been well-studied for geometric and numerical data, but more rarely applied to text. In this paper, we examine the impact of five DR techniques on the accuracy of two supervised classifiers on three textual sources. This task mirrors important real world problems, such as classifying Web pages or scientific articles. In addition, the accuracy serves as a proxy measure for how well each DR technique preserves the inter-document relationships while vastly reducing the size of the data, facilitating more sophisticated analysis. We show that, for a fixed number of dimensions, DR can be very successful at improving accuracy compared to using the original words as features. Surprisingly, we also find that one of the simplest DR techniques, MDS, is among the most effective. This suggests that textual data may often lie upon a linear manifold where the more complex non-linear DR techniques do not have an advantage.
Keywords :
data analysis; data reduction; pattern classification; text analysis; supervised classifiers; text analysis enhancement; textual data dimensionality reduction; textual document representation; Application software; Computer science; Data analysis; Government; Mirrors; Performance analysis; Principal component analysis; Size measurement; Text analysis; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on
Conference_Location :
Las Vegas, IL
Print_ISBN :
1-4244-1500-4
Electronic_ISBN :
1-4244-1500-4
Type :
conf
DOI :
10.1109/IRI.2007.4296645
Filename :
4296645
Link To Document :
بازگشت