• DocumentCode
    3669388
  • Title

    Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation

  • Author

    Karim Sayadi;Quang Vu Bui;Marc Bui

  • Author_Institution
    University Pierre and Marie Curie CHArt Laboratory EA 4004 Paris, France
  • fYear
    2015
  • fDate
    7/1/2015 12:00:00 AM
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    The classification of web pages content is essential to many information retrieval tasks. In this paper, we propose a new methodology for a multilayer soft classification. Our approach is based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier. We compute with LDA the distribution of topics in each document and use the results to train the Random Forest classifier. The trained classifier is then able to categorize each web document in different layers of the categories hierarchy. We have applied our methodology on a collected data set from dmoz and have obtained satisfactory results.
  • Keywords
    "Mathematical model","Web pages","Resource management","Vegetation","Libraries","Standards","Nonhomogeneous media"
  • Publisher
    ieee
  • Conference_Titel
    Innovations for Community Services (I4CS), 2015 15th International Conference on
  • Print_ISBN
    978-1-4673-7327-2
  • Type

    conf

  • DOI
    10.1109/I4CS.2015.7294479
  • Filename
    7294479