DocumentCode :
1621079
Title :
Page Digest for large-scale Web services
Author :
Rocco, Daniel ; Buttler, David ; Liu, Ling
Author_Institution :
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
fYear :
2003
Firstpage :
381
Lastpage :
390
Abstract :
We introduce Page Digest, a mechanism for efficient storage and processing of Web documents. The Page Digest design encourages a clean separation of the structural elements of Web documents from their content. Its encoding transformation produces many of the advantages of traditional string digest schemes yet remains invertible without introducing significant additional cost or complexity. Using the Page Digest encoding can provide at least an order of magnitude speedup when traversing a Web document as compared to using a standard document object model implementation. Our experiments show that change detection using Page Digest operates in linear time, offering 75% improvement in execution performance compared with existing systems. In addition, the Page Digest encoding can reduce the tag name redundancy found in Web documents, allowing 30% to 50% reduction in document size.
Keywords :
Internet; abstracting; content management; document handling; information storage; HTML documents; Web document processing; Web document storage; change detection; content element; data management; document format; document layout; document object model implementation; document size reduction; encoding transformation; execution performance improvement; information collection; large scale Web service; linear time operation; magnitude speedup; page digest encoding; semantic information; string digest scheme; structural element separation; tag name redundancy; Costs; Educational institutions; Encoding; HTML; Knowledge management; Large-scale systems; Memory; Search engines; Web and internet services; Web services;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
E-Commerce, 2003. CEC 2003. IEEE International Conference on
Print_ISBN :
0-7695-1969-5
Type :
conf
DOI :
10.1109/COEC.2003.1210274
Filename :
1210274
Link To Document :
بازگشت