Title :
Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation
Author :
Karim Sayadi;Quang Vu Bui;Marc Bui
Author_Institution :
University Pierre and Marie Curie CHArt Laboratory EA 4004 Paris, France
fDate :
7/1/2015 12:00:00 AM
Abstract :
The classification of web pages content is essential to many information retrieval tasks. In this paper, we propose a new methodology for a multilayer soft classification. Our approach is based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier. We compute with LDA the distribution of topics in each document and use the results to train the Random Forest classifier. The trained classifier is then able to categorize each web document in different layers of the categories hierarchy. We have applied our methodology on a collected data set from dmoz and have obtained satisfactory results.
Keywords :
"Mathematical model","Web pages","Resource management","Vegetation","Libraries","Standards","Nonhomogeneous media"
Conference_Titel :
Innovations for Community Services (I4CS), 2015 15th International Conference on
Print_ISBN :
978-1-4673-7327-2
DOI :
10.1109/I4CS.2015.7294479