Title of article :
Link-Based Similarity Measures for the Classification
of Web Documents
Author/Authors :
P?vel Calado and Marco Cristo، نويسنده , , Marcos André Gonçalves، نويسنده , , Edleno S. de Moura، نويسنده , , Berthier Ribeiro-Neto، نويسنده , , Nivio Ziviani، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2006
Abstract :
Traditional text-based document classifiers tend to perform
poorly on the Web. Text in Web documents is usually
noisy and often does not contain enough information
to determine their topic. However, the Web provides
a different source that can be useful to document classification:
its hyperlink structure. In this work, the authors
evaluate how the link structure of the Web can be used
to determine a measure of similarity appropriate for document
classification. They experiment with five different
similarity measures and determine their adequacy for
predicting the topic of a Web page. Tests performed on a
Web directory show that link information alone allows
classifying documents with an average precision of
86%. Further, when combined with a traditional textbased
classifier, precision increases to values of up to
90%, representing gains that range from 63 to 132% over
the use of text-based classification alone. Because the
measures proposed in this article are straightforward to
compute, they provide a practical and effective solution
for Web classification and related information retrieval
tasks. Further, the authors provide an important set of
guidelines on how link structure can be used effectively
to classify Web documents
Journal title :
Journal of the American Society for Information Science and Technology
Journal title :
Journal of the American Society for Information Science and Technology