DocumentCode
3104854
Title
An Efficient Document Categorization Model Based on LSA and BPNN
Author
Li, Cheng Hua ; Park, Soon Cheol
fYear
2007
fDate
22-24 Aug. 2007
Firstpage
9
Lastpage
14
Abstract
This paper proposed a new document categorization model using the methods of latent semantic analysis (LSA) and back-propagation neural network (BPNN). In traditional word-matching based document categorization system, the most popular and straightforward approach to represent the document is vector space model (VSM). However, this approach has drawbacks. Firstly, because it needs a large number of features to represent the documents, so the dimensionality is very high. Secondly, it dose not take into account the effects of synonymy and polysemy, which could have an impact on classification accuracy. Latent Semantic Analysis (LSA) can overcome the problems by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. Introduced the latent semantic analysis to our model could not only greatly reduce the dimensionality but also discover the important associative relationships between terms. It also helps to accelerate the training speed and improve the classification accuracy. We test our categorization model on the standard Reuter collection, experimental evaluations show that the model with LSA can lead to dramatic dimensionality reduction while achieving good classification results.
Keywords
Acceleration; Information analysis; Information technology; Neural networks; Ontologies; Semantic Web; Support vector machine classification; Support vector machines; Testing; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on
Conference_Location
Luoyang, Henan, China
Print_ISBN
978-0-7695-2930-1
Type
conf
DOI
10.1109/ALPIT.2007.88
Filename
4460607
Link To Document