DocumentCode
2908538
Title
Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures
Author
Elhadi, Mohamed ; Al-Tobi, Amjad
Author_Institution
Dept. of Comput. Sci., Sultan Qaboos Univ., Muscat, Oman
fYear
2009
fDate
24-26 Nov. 2009
Firstpage
679
Lastpage
684
Abstract
This paper reports on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. The text´s syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.
Keywords
computational linguistics; document handling; WebPages; document syntactical structure; duplicate detection; longest common subsequence; part-of-speech; text syntactical structure; Computer science; Content based retrieval; Filtering; Fingerprint recognition; Information analysis; Information retrieval; Information technology; Performance analysis; Search engines; Speech analysis; Duplication Filtering; Longest Common Subsequence; Part-of-Speech; Syntactical Structure;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on
Conference_Location
Seoul
Print_ISBN
978-1-4244-5244-6
Electronic_ISBN
978-0-7695-3896-9
Type
conf
DOI
10.1109/ICCIT.2009.235
Filename
5368928
Link To Document