DocumentCode :
3346374
Title :
An Improved Algorithm of STC for the Deletion of Duplicated Web pages Based on Repeated Strings
Author :
Wang Huijiao ; Yin Bo ; Hou Jie
Author_Institution :
Sch. of Comput. & Control of Comput. Sci., Guilin Univ. of Electron. Technol., Guilin, China
fYear :
2009
fDate :
14-17 Oct. 2009
Firstpage :
414
Lastpage :
417
Abstract :
This paper proposes an improved algorithm of STC for deleting duplicated Web pages based on repeated strings. The main method of the algorithm is to extract repeated character strings. The repeated strings are used as the mark of each phrase in order to build the suffix tree. This is mapped onto the inverse index in order to enable the STC algorithm to delete duplication. The algorithm also aims to reduce the errors made by the existing algorithms for deletion. Experimental results indicate that the improved algorithm has a better rate of accuracy and good temporal and spatial characteristics.
Keywords :
Web sites; document handling; string matching; STC algorithm; duplicated Web page deletion; inverse index; repeated character string extraction; suffix tree; Algorithm design and analysis; Clustering algorithms; Computer science; Data mining; Fingerprint recognition; Genetics; Internet; Paper technology; Search engines; Web pages; deletion of duplicated Web pages; repeated string; the algorithm of STC;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Genetic and Evolutionary Computing, 2009. WGEC '09. 3rd International Conference on
Conference_Location :
Guilin
Print_ISBN :
978-0-7695-3899-0
Type :
conf
DOI :
10.1109/WGEC.2009.97
Filename :
5402860
Link To Document :
بازگشت