• DocumentCode
    2774720
  • Title

    Detection of Verbatim or Partial Duplication from Multiple Source Documents Using Data Mining Techniques and Case-Based Reasoning Methodologies

  • Author

    Chaudhuri, Chitrita ; Chaudhuri, Atal

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Jadavpur Univ., Kolkata, India
  • fYear
    2011
  • fDate
    19-20 Feb. 2011
  • Firstpage
    129
  • Lastpage
    132
  • Abstract
    This paper aims to specify a Case-Based Reasoning strategy for correctly classifying, storing and preventing duplication efforts of electronic text material. Preservation of complete source documents for checking similarity between them pose a daunting amount of spatial and computational complexity to researchers in this area. The problem is partially solved by applying certain preprocessing steps to reduce the volume of data handling substantially. Reduction of volume in text documents is achieved by applying some stemming algorithm and elimination of stop words from the document utilizing certain text-mining measures such as TF-IDF. A third technique involves extraction of keywords and storing them in a properly indexed base. These then can serve the dual purpose of providing solutions to Lazy Learning classification for automatic subject-wise archiving and formation of relevant word sequences for detection of plagiarism using Association Rule-mining techniques.
  • Keywords
    case-based reasoning; data mining; information retrieval; learning (artificial intelligence); pattern classification; reproduction (copying); text analysis; TF-IDF; association rule mining; automatic subject wise archiving; case based reasoning; data handling; data mining; electronic text material; keyword extraction; lazy learning classification; multiple source document; partial duplication; plagiarism detection; similarity check; stemming algorithm; stopword elimination; text document; text mining measure; verbatim detection; word sequences; Algorithm design and analysis; Classification algorithms; Cognition; Data mining; Frequency conversion; Plagiarism; Time frequency analysis; Association Rule-mining techniques; Case-BasedReasoning strategies; Plagiarism; TF-IDF;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Emerging Applications of Information Technology (EAIT), 2011 Second International Conference on
  • Conference_Location
    Kolkata
  • Print_ISBN
    978-1-4244-9683-9
  • Type

    conf

  • DOI
    10.1109/EAIT.2011.31
  • Filename
    5734933