Title :
What´s there and what´s not?: focused crawling for missing documents in digital libraries
Author :
Zhuang, Ziming ; Wagle, Rohit ; Giles, C. Lee
Author_Institution :
Sch. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA
Abstract :
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors´ self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors´ homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors´ homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler´s performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures
Keywords :
digital libraries; information retrieval; information retrieval system evaluation; search engines; CiteSeer; academic papers; author home page location; automatic heuristic-based system; computer science publishing venues; digital libraries; document collection; focused crawling; missing documents; online resources; publication metadata; Computer science; Crawlers; Large-scale systems; Machinery; Permission; Publishing; Robots; Search engines; Software libraries; Web pages; ACM; CiteSeer; DBLP; digital libraries; focused crawler; harvesting;
Conference_Titel :
Digital Libraries, 2005. JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on
Conference_Location :
Denver, CO
Print_ISBN :
1-58113-876-8
DOI :
10.1145/1065385.1065455