An Improved Algorithm of STC for the Deletion of Duplicated Web pages Based on Repeated Strings

Author

Wang Huijiao ; Yin Bo ; Hou Jie

Author_Institution

Sch. of Comput. & Control of Comput. Sci., Guilin Univ. of Electron. Technol., Guilin, China

fYear

2009

fDate

14-17 Oct. 2009

Firstpage

414

Lastpage

417

Abstract

This paper proposes an improved algorithm of STC for deleting duplicated Web pages based on repeated strings. The main method of the algorithm is to extract repeated character strings. The repeated strings are used as the mark of each phrase in order to build the suffix tree. This is mapped onto the inverse index in order to enable the STC algorithm to delete duplication. The algorithm also aims to reduce the errors made by the existing algorithms for deletion. Experimental results indicate that the improved algorithm has a better rate of accuracy and good temporal and spatial characteristics.

Keywords

Web sites; document handling; string matching; STC algorithm; duplicated Web page deletion; inverse index; repeated character string extraction; suffix tree; Algorithm design and analysis; Clustering algorithms; Computer science; Data mining; Fingerprint recognition; Genetics; Internet; Paper technology; Search engines; Web pages; deletion of duplicated Web pages; repeated string; the algorithm of STC;

fLanguage

English

Publisher

ieee

Conference_Titel

Genetic and Evolutionary Computing, 2009. WGEC '09. 3rd International Conference on

Conference_Location

Guilin

Print_ISBN

978-0-7695-3899-0

Type

conf

DOI

10.1109/WGEC.2009.97

Filename

5402860