Title :
Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures
Author :
Elhadi, Mohamed ; Al-Tobi, Amjad
Author_Institution :
Dept. of Comput. Sci., Sultan Qaboos Univ., Muscat, Oman
Abstract :
This paper reports on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. The text´s syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.
Keywords :
computational linguistics; document handling; WebPages; document syntactical structure; duplicate detection; longest common subsequence; part-of-speech; text syntactical structure; Computer science; Content based retrieval; Filtering; Fingerprint recognition; Information analysis; Information retrieval; Information technology; Performance analysis; Search engines; Speech analysis; Duplication Filtering; Longest Common Subsequence; Part-of-Speech; Syntactical Structure;
Conference_Titel :
Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on
Conference_Location :
Seoul
Print_ISBN :
978-1-4244-5244-6
Electronic_ISBN :
978-0-7695-3896-9
DOI :
10.1109/ICCIT.2009.235