Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures

Author

Elhadi, Mohamed ; Al-Tobi, Amjad

Author_Institution

Dept. of Comput. Sci., Sultan Qaboos Univ., Muscat, Oman

fYear

2009

fDate

24-26 Nov. 2009

Firstpage

679

Lastpage

684

Abstract

This paper reports on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. The text´s syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.

Keywords

computational linguistics; document handling; WebPages; document syntactical structure; duplicate detection; longest common subsequence; part-of-speech; text syntactical structure; Computer science; Content based retrieval; Filtering; Fingerprint recognition; Information analysis; Information retrieval; Information technology; Performance analysis; Search engines; Speech analysis; Duplication Filtering; Longest Common Subsequence; Part-of-Speech; Syntactical Structure;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on

Conference_Location

Seoul

Print_ISBN

978-1-4244-5244-6

Electronic_ISBN

978-0-7695-3896-9

Type

conf

DOI

10.1109/ICCIT.2009.235

Filename

5368928