DocumentCode :
2982140
Title :
Research and application of the detection on duplicate web pages on campus search engine
Author :
Gao, Yongbing ; Zhang, Fang ; Hao, Bin ; Gong, Wei
Author_Institution :
Sch. of Inf. Eng., Inner Mongolia Univ. of Sci. & Technol., Baotou, China
fYear :
2012
fDate :
22-24 June 2012
Firstpage :
555
Lastpage :
558
Abstract :
At present, for some commercial purposes, general search engine can´t satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.
Keywords :
Web sites; educational institutions; search engines; Nutch; campus network search engine; department Website; detection algorithm; duplicate Web page detection; duplicate document; fingerprint; general search engine; information resource; longest paragraph; resistance noise; space complexity; time complexity; university; Accuracy; Educational institutions; Fingerprint recognition; Campus Search Engine; Duplicate Detection; MD5; Nutch; Paragraph Fingerprint;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering and Service Science (ICSESS), 2012 IEEE 3rd International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2007-8
Type :
conf
DOI :
10.1109/ICSESS.2012.6269527
Filename :
6269527
Link To Document :
بازگشت