• DocumentCode
    3107061
  • Title

    Plagiarism Detection in arXiv

  • Author

    Sorokina, Daria ; Gehrke, Johannes ; Warner, Simeon ; Ginsparg, Paul

  • Author_Institution
    Dept. of Comput. Sci., Cornell Univ., Ithaca, NY
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    1070
  • Lastpage
    1075
  • Abstract
    We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.
  • Keywords
    research and development; text analysis; arXiv; plagiarism detection; problematic author behaviors; research document collections; Application software; Computer science; Displays; History; Information science; Large-scale systems; Physics computing; Plagiarism; Sequences; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2006. ICDM '06. Sixth International Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1550-4786
  • Print_ISBN
    0-7695-2701-7
  • Type

    conf

  • DOI
    10.1109/ICDM.2006.126
  • Filename
    4053155