Abstract :
The existence, public availability, and widespread acceptance
of a standard benchmark for a given information
retrieval (IR) task are beneficial to research on
this task, because they allow different researchers to
experimentally compare their own systems by comparing
the results they have obtained on this benchmark.
The Reuters-21578 test collection, together with
its earlier variants, has been such a standard benchmark
for the text categorization (TC) task throughout
the last 10 years.However , the benefits that this has
brought about have somehow been limited by the fact
that different researchers have “carved” different subsets
out of this collection and tested their systems on
one of these subsets only; systems that have been
tested on different Reuters-21578 subsets are thus not
readily comparable.In this article, we present a systematic,
comparative experimental study of the three
subsets of Reuters-21578 that have been most popular
among TC researchers.The results we obtain allow us
to determine the relative hardness of these subsets,
thus establishing an indirect means for comparing TC
systems that have, or will be, tested on these different
subsets.