Title of article :
Managing Déjà Vu: Collection Building for the
Identification of Nonidentical Duplicate Documents
Author/Authors :
Jack G. Conrad، نويسنده , , Cindy P. Schriber، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2006
Abstract :
As online document collections continue to expand,
both on the Web and in proprietary environments, the
need for duplicate detection becomes more critical. Few
users wish to retrieve search results consisting of sets
of duplicate documents, whether identical duplicates
or close variants. The goal of this work is to facilitate
(a) investigations into the phenomenon of near duplicates
and (b) algorithmic approaches to minimizing its
deleterious effect on search results. Harnessing the
expertise of both client-users and professional searchers,
we establish principled methods to generate a test collection
for identifying and handling nonidentical duplicate
documents. We subsequently examine a flexible
method of characterizing and comparing documents to
permit the identification of near duplicates. This method
has produced promising results following an extensive
evaluation using a production-based test collection created
by domain experts.
Journal title :
Journal of the American Society for Information Science and Technology
Journal title :
Journal of the American Society for Information Science and Technology