DocumentCode
3107061
Title
Plagiarism Detection in arXiv
Author
Sorokina, Daria ; Gehrke, Johannes ; Warner, Simeon ; Ginsparg, Paul
Author_Institution
Dept. of Comput. Sci., Cornell Univ., Ithaca, NY
fYear
2006
fDate
18-22 Dec. 2006
Firstpage
1070
Lastpage
1075
Abstract
We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.
Keywords
research and development; text analysis; arXiv; plagiarism detection; problematic author behaviors; research document collections; Application software; Computer science; Displays; History; Information science; Large-scale systems; Physics computing; Plagiarism; Sequences; Testing;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2006. ICDM '06. Sixth International Conference on
Conference_Location
Hong Kong
ISSN
1550-4786
Print_ISBN
0-7695-2701-7
Type
conf
DOI
10.1109/ICDM.2006.126
Filename
4053155
Link To Document