DocumentCode
3669388
Title
Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation
Author
Karim Sayadi;Quang Vu Bui;Marc Bui
Author_Institution
University Pierre and Marie Curie CHArt Laboratory EA 4004 Paris, France
fYear
2015
fDate
7/1/2015 12:00:00 AM
Firstpage
1
Lastpage
7
Abstract
The classification of web pages content is essential to many information retrieval tasks. In this paper, we propose a new methodology for a multilayer soft classification. Our approach is based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier. We compute with LDA the distribution of topics in each document and use the results to train the Random Forest classifier. The trained classifier is then able to categorize each web document in different layers of the categories hierarchy. We have applied our methodology on a collected data set from dmoz and have obtained satisfactory results.
Keywords
"Mathematical model","Web pages","Resource management","Vegetation","Libraries","Standards","Nonhomogeneous media"
Publisher
ieee
Conference_Titel
Innovations for Community Services (I4CS), 2015 15th International Conference on
Print_ISBN
978-1-4673-7327-2
Type
conf
DOI
10.1109/I4CS.2015.7294479
Filename
7294479
Link To Document