An Efficient Document Categorization Model Based on LSA and BPNN

Author

Li, Cheng Hua ; Park, Soon Cheol

fYear

2007

fDate

22-24 Aug. 2007

Firstpage

9

Lastpage

14

Abstract

This paper proposed a new document categorization model using the methods of latent semantic analysis (LSA) and back-propagation neural network (BPNN). In traditional word-matching based document categorization system, the most popular and straightforward approach to represent the document is vector space model (VSM). However, this approach has drawbacks. Firstly, because it needs a large number of features to represent the documents, so the dimensionality is very high. Secondly, it dose not take into account the effects of synonymy and polysemy, which could have an impact on classification accuracy. Latent Semantic Analysis (LSA) can overcome the problems by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. Introduced the latent semantic analysis to our model could not only greatly reduce the dimensionality but also discover the important associative relationships between terms. It also helps to accelerate the training speed and improve the classification accuracy. We test our categorization model on the standard Reuter collection, experimental evaluations show that the model with LSA can lead to dramatic dimensionality reduction while achieving good classification results.

Keywords

Acceleration; Information analysis; Information technology; Neural networks; Ontologies; Semantic Web; Support vector machine classification; Support vector machines; Testing; Text categorization;

fLanguage

English

Publisher

ieee

Conference_Titel

Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on

Conference_Location

Luoyang, Henan, China

Print_ISBN

978-0-7695-2930-1

Type

conf

DOI

10.1109/ALPIT.2007.88

Filename

4460607