DocumentCode :
2638583
Title :
Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm
Author :
Elhadi, Mohamed ; Al-Tobi, Amjad
Author_Institution :
Dept. of Comput. Sci., Sultan Qaboos Univ., Oman
Volume :
1
fYear :
2009
fDate :
March 31 2009-April 2 2009
Firstpage :
630
Lastpage :
634
Abstract :
Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate Web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a Webpage´s text into a string of tags. The produced string was then subjected into the longest common sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate Webpages. The process of tagging and aligning was based on set of sentences extracted from the Web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.
Keywords :
Internet; query processing; search engines; text analysis; Web page duplicate detection; combined POS algorithm; combined syntactical category; common sequence technique; keyword querying; near-duplicate Web-page; re-ranking measure; search engine; sentence extraction; sequence alignment algorithm; text similarity calculation; weed-out duplicate Web-page; Computational biology; Computer science; Content based retrieval; Information retrieval; Proteins; Search engines; Sequences; Tagging; Text analysis; Web pages; Copy Detection; Duplicate; LCS; Longest Common Sequence; POS; Part-of-speech; Search Engine;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Information Engineering, 2009 WRI World Congress on
Conference_Location :
Los Angeles, CA
Print_ISBN :
978-0-7695-3507-4
Type :
conf
DOI :
10.1109/CSIE.2009.771
Filename :
5171248
Link To Document :
بازگشت