• DocumentCode
    2908538
  • Title

    Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures

  • Author

    Elhadi, Mohamed ; Al-Tobi, Amjad

  • Author_Institution
    Dept. of Comput. Sci., Sultan Qaboos Univ., Muscat, Oman
  • fYear
    2009
  • fDate
    24-26 Nov. 2009
  • Firstpage
    679
  • Lastpage
    684
  • Abstract
    This paper reports on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. The text´s syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.
  • Keywords
    computational linguistics; document handling; WebPages; document syntactical structure; duplicate detection; longest common subsequence; part-of-speech; text syntactical structure; Computer science; Content based retrieval; Filtering; Fingerprint recognition; Information analysis; Information retrieval; Information technology; Performance analysis; Search engines; Speech analysis; Duplication Filtering; Longest Common Subsequence; Part-of-Speech; Syntactical Structure;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on
  • Conference_Location
    Seoul
  • Print_ISBN
    978-1-4244-5244-6
  • Electronic_ISBN
    978-0-7695-3896-9
  • Type

    conf

  • DOI
    10.1109/ICCIT.2009.235
  • Filename
    5368928