مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

168343

Title :

Finding pages on the unarchived Web

Author :

Huurdeman, Hugo C. ; Ben-David, Anat ; Kamps, Jaap ; Samar, Thaer ; de Vries, Arjen P.

Author_Institution :

Univ. of Amsterdam, Amsterdam, Netherlands

fYear :

2014

fDate :

8-12 Sept. 2014

Firstpage :

331

Lastpage :

340

Abstract :

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.

Keywords :

Web sites; information retrieval; search engines; DutchWeb archive; Web archives; Web sites; anchor descriptions; crawling depth; crawling restrictions; known-item search setting; page retrieval; restrictive selection policies; skewed distribution; Context; Crawlers; Cultural differences; Internet; Libraries; Materials; Uniform resource locators; Anchor text; Information retrieval; Link evidence; Web archives; Web archiving; Web crawlers;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on

Conference_Location :

London

Type :

conf

DOI :

10.1109/JCDL.2014.6970188

Filename :

6970188

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=168343