• DocumentCode
    3104854
  • Title

    An Efficient Document Categorization Model Based on LSA and BPNN

  • Author

    Li, Cheng Hua ; Park, Soon Cheol

  • fYear
    2007
  • fDate
    22-24 Aug. 2007
  • Firstpage
    9
  • Lastpage
    14
  • Abstract
    This paper proposed a new document categorization model using the methods of latent semantic analysis (LSA) and back-propagation neural network (BPNN). In traditional word-matching based document categorization system, the most popular and straightforward approach to represent the document is vector space model (VSM). However, this approach has drawbacks. Firstly, because it needs a large number of features to represent the documents, so the dimensionality is very high. Secondly, it dose not take into account the effects of synonymy and polysemy, which could have an impact on classification accuracy. Latent Semantic Analysis (LSA) can overcome the problems by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. Introduced the latent semantic analysis to our model could not only greatly reduce the dimensionality but also discover the important associative relationships between terms. It also helps to accelerate the training speed and improve the classification accuracy. We test our categorization model on the standard Reuter collection, experimental evaluations show that the model with LSA can lead to dramatic dimensionality reduction while achieving good classification results.
  • Keywords
    Acceleration; Information analysis; Information technology; Neural networks; Ontologies; Semantic Web; Support vector machine classification; Support vector machines; Testing; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on
  • Conference_Location
    Luoyang, Henan, China
  • Print_ISBN
    978-0-7695-2930-1
  • Type

    conf

  • DOI
    10.1109/ALPIT.2007.88
  • Filename
    4460607