DocumentCode :
2908538
Title :
Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures
Author :
Elhadi, Mohamed ; Al-Tobi, Amjad
Author_Institution :
Dept. of Comput. Sci., Sultan Qaboos Univ., Muscat, Oman
fYear :
2009
fDate :
24-26 Nov. 2009
Firstpage :
679
Lastpage :
684
Abstract :
This paper reports on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. The text´s syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.
Keywords :
computational linguistics; document handling; WebPages; document syntactical structure; duplicate detection; longest common subsequence; part-of-speech; text syntactical structure; Computer science; Content based retrieval; Filtering; Fingerprint recognition; Information analysis; Information retrieval; Information technology; Performance analysis; Search engines; Speech analysis; Duplication Filtering; Longest Common Subsequence; Part-of-Speech; Syntactical Structure;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on
Conference_Location :
Seoul
Print_ISBN :
978-1-4244-5244-6
Electronic_ISBN :
978-0-7695-3896-9
Type :
conf
DOI :
10.1109/ICCIT.2009.235
Filename :
5368928
Link To Document :
بازگشت