Title :
Center Block Duplication Detection and Indexing for Efficient Web Information Retrieval
Author :
Cadenhead, Tyrone ; Chen, Jinlin ; Cook, Terry
Author_Institution :
Dept. of Comput. Sci., CUNY, Flushing, NY
Abstract :
Duplicated information in today´s Web has serious negative impact to search engines in that it increases the size of the index and results in low efficiency for Web information retrieval. A large amount of Web content duplication happens at block level in addition to page level. Besides, when searching the web, in most cases the desired information is located at the center block of a page. Based on these two observations, we propose a block level duplication detection algorithm, and index center blocks instead of entire Web pages for Web information retrieval. Experiments show that these strategies can effectively reduce index size and indexing time without sacrificing the effectiveness of Web information retrieval.
Keywords :
Internet; indexing; information retrieval; search engines; Web information retrieval; center block duplication detection; indexing; search engine; Computer science; Detection algorithms; Educational institutions; Image processing; Indexing; Information retrieval; Infrared detectors; Search engines; Web pages; Web sites;
Conference_Titel :
Digital Information Management, 2006 1st International Conference on
Conference_Location :
Bangalore
Print_ISBN :
1-4244-0682-X
DOI :
10.1109/ICDIM.2007.369232