• DocumentCode
    2113369
  • Title

    Semi-Supervised Latent Dirichlet Allocation and Its Application for Document Classification

  • Author

    Di Wang ; Thint, M. ; Al-Rubaie, Ahmad

  • Author_Institution
    Etisalat BT Innovation Center, Khalifa Univ., Abu Dhabi, United Arab Emirates
  • Volume
    3
  • fYear
    2012
  • fDate
    4-7 Dec. 2012
  • Firstpage
    306
  • Lastpage
    310
  • Abstract
    Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling method widely applied in natural language processing. However, standard LDA does not permit the use of supervised labels to incorporate expert knowledge into the learning procedure. This paper describes a semi-supervised LDA (ssLDA) method that supports multiple-topic labels per document, to incorporate available expert knowledge during the model construction. This improvement enables the alignment of resulting model with human expectations for topic modeling and extraction. We apply ssLDA to document classification problem on benchmark datasets. We investigate and compare how the size of training set and proportion of supervised data affect the final model structure and improve the prediction accuracy.
  • Keywords
    document handling; natural language processing; benchmark datasets; document classification problem; expert knowledge; model construction; multiple topic labels; natural language processing; semisupervised LDA; semisupervised latent dirichlet allocation; standard LDA; unsupervised topic modeling method; Latent Dirichlet allocation (LDA); natural language processing; semi-supervised LDA; semi-supervised learning; supervised learning; unsuperviased learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
  • Conference_Location
    Macau
  • Print_ISBN
    978-1-4673-6057-9
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2012.211
  • Filename
    6511698