DocumentCode :
1791634
Title :
Recall estimation for rare topic retrieval from large corpuses
Author :
Bommannavar, Praveen ; Kolcz, Alek ; Rajaraman, Anand
fYear :
2014
fDate :
27-30 Oct. 2014
Firstpage :
825
Lastpage :
834
Abstract :
The problem of finding documents pertaining to a particular topic finds application in a variety of scenarios. Indeed, the demand for topically pertinent documents has led to myriad companies offering services to find and deliver them (perhaps along with sentiment analysis or clustering) to customers for any topics of interest. The methodologies used to uncover relevant documents range from manually curated keyword filters to trained classification models. Any serious topical analysis requires a sound understanding of key metrics behind the retrieval process, two of the most important being precision and recall. While precision can be easily and inexpensively measured by sampling from classified documents and utilizing (paid) human computation to mark incorrectly classified instances, it is not as straightforward to use the same approach for measuring recall. With most topics occurring relatively sparsely, an unbiased sampling approach becomes prohibitively expensive. In this paper, we introduce a recall measurement procedure requiring only relatively few human judgements. The technique makes use of pairs of sufficiently independent classifiers and the paper provides a detailed discussion of how such classifier pairs can be constructed in practice, with a focus on social media classifiers. We report the performance of the proposed method with simple keyword filters as well as with classifiers of progressive levels of complexity and show that under reasonable conditions, recall can be estimated to within 0.10 absolute error and 15% relative error, and often closer with a reduction of cost by a factor of as much as 1000x as compared with unbiased sampling.
Keywords :
document handling; information retrieval; pattern classification; pattern clustering; classification models; classified documents; clustering; keyword filters; pertinent documents; rare topic retrieval; recall estimation; recall measurement procedure; sentiment analysis; social media classifiers; topical analysis; unbiased sampling; Companies; Estimation; Joints; Labeling; Measurement; Media; Twitter; Twitter; classifier evaluation; human evaluation; mechanical turk; recall estimation; social media;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
Type :
conf
DOI :
10.1109/BigData.2014.7004312
Filename :
7004312
Link To Document :
بازگشت