Title :
Query revision during cluster based search on large unstructured corpora
Author :
Deolalikar, Vinay
Author_Institution :
Hewlett-Packard Res., Sunnyvale, CA, USA
Abstract :
We investigate a frequently occurring issue in search (retrieval) in the age of big unstructured data. Searches conducted on large unstructured corpora result in long results lists. Such results lists are often clustered and reranked for ease of navigation. Should a query be revised during time-critical examinations of such long cluster based reranked lists? This question arises naturally during early stages of commercially important applications of IR such as eDiscovery, but has not yet been given any research attention. Four factors compound the difficulty of this question in the setting of eDiscovery: (a) the query sources (the technical experts) are different from the legal staff that are actually executing the query and using the retrieval system, (b) the retrieved lists for each query tend to be very long, and (c) the user might be accessing these retrieved results through a clustering interface, and (c) all decisions must be transparent and easy to explain due to the litigious nature of the application. Analogous difficulties arise in other applications involving search over large unstructured corpora. We introduce a framework to help users make the decision of “whether to revise.” Our framework consists of two components. First, we introduce a “limited view” which is a summary of a long cluster-based reranked list. This is the first input to the user. This provides the user a summary of the long cluster-based list. Second, we construct query predictors for this limited view, and provide their prediction as a second input to the user. This prediction is used to corroborate the inspection of the summary limited view. The proposed combination of a limited view and query performance prediction can assist search staff in determining whether to pursue an expensive query revision or not, as well as save precious time by precluding inspections of lists with very few relevant documents during the early stages of commercially important- applications such as eDiscovery.
Keywords :
document handling; pattern clustering; query processing; IR; big unstructured data; cluster based search; clustering interface; eDiscovery; large unstructured corpora; legal staff; long-cluster-based reranked lists; query performance prediction; query predictors; query revision; query sources; results list clustering; results list re-ranking; retrieval system; retrieved result access; search staff; summary limited view inspection; technical experts; time-critical examinations; Clustering algorithms; Law; Measurement; Search problems; Standards; Vectors;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004314