Title :
Incremental Web Page Template Detection by Text Segments
Author :
Wang, Yu ; Fang, Bingxing ; Cheng, Xueqi ; Guo, Li ; Xu, Hongbo
Author_Institution :
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
Abstract :
Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.
Keywords :
Internet; text analysis; Web pages; incremental Web page template detection; text segments; Bars; Cache storage; Computers; Conferences; Degradation; Delay; Feeds; Navigation; Search engines; Web pages;
Conference_Titel :
Semantic Computing and Systems, 2008. WSCS '08. IEEE International Workshop on
Conference_Location :
Huangshan
Print_ISBN :
978-0-7695-3316-2
Electronic_ISBN :
978-0-7695-3316-2
DOI :
10.1109/WSCS.2008.17