DocumentCode :
1825711
Title :
A comparative study of citations and links in document classification
Author :
Couto, Thierson ; Cristo, Marco ; Gonçalves, Marcos André ; Calado, Pável ; Ziviani, Nivio ; Moura, Edleno ; Ribeiro-Neto, Berthier
Author_Institution :
Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte
fYear :
2006
fDate :
38869
Firstpage :
75
Lastpage :
84
Abstract :
It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans
Keywords :
Internet; citation analysis; classification; digital libraries; text analysis; Web collection; citation analysis; digital library; document classification; text classification; Computer science; Gain measurement; Humans; Information retrieval; Performance evaluation; Performance gain; Permission; Software libraries; Text categorization; Web pages; digital libraries; links; text classification; web directories;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Libraries, 2006. JCDL '06. Proceedings of the 6th ACM/IEEE-CS Joint Conference on
Conference_Location :
Chapel Hill, NC
Print_ISBN :
1-59593-354-9
Type :
conf
DOI :
10.1145/1141753.1141766
Filename :
4119100
Link To Document :
بازگشت