DocumentCode :
1900572
Title :
Web Contents Tracking by Learning of Page Grammars
Author :
Kukulenz, Dirk ; Reinke, Christoph ; Hoeller, Nils
Author_Institution :
Inst. of Inf. Syst., Luebeck Univ., Luebeck
fYear :
2008
fDate :
8-13 June 2008
Firstpage :
416
Lastpage :
425
Abstract :
A significant fraction of Web data is available only for short periods of time. We consider methods to keep track and to record such dynamic information automatically. The main problems are to find adequate reload times for Web data in order to reduce network traffic, to improve the freshness of obtained data and to reduce the risk of loosing information. Previous approaches usually improve reload strategies for Web data by considering the change dynamics of pages, by modeling the behavior statistically and then by applying suitable reload strategies. Based on this approach we first give a precise definition of data changes on the Web. Page changes are described by a page decomposition which is based on the estimation of grammars. Based on this decomposition segments of Web pages are identified. The change behavior of individual segments is recorded and applied to optimize reload strategies. We show that the completeness of obtained data and the network traffic may be improved significantly by applying our new reload strategy.
Keywords :
Web sites; Web contents tracking; Web data; network traffic; page grammars learning; reload strategies; Bandwidth; Crawlers; HTML; Information systems; Navigation; Search engines; Telecommunication traffic; Traffic control; Web and internet services; Web pages; page decomposition; pruning; tracking;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Internet and Web Applications and Services, 2008. ICIW '08. Third International Conference on
Conference_Location :
Athens
Print_ISBN :
978-0-7695-3163-2
Electronic_ISBN :
978-0-7695-3163-2
Type :
conf
DOI :
10.1109/ICIW.2008.58
Filename :
4545649
Link To Document :
بازگشت