Title :
The Adaptive Projection Forest: Using adjustable exclusion and parallelism in metric space indexes
Author :
Thompson, Lee Parnell ; Weijia Xu ; Miranker, Daniel P.
Author_Institution :
Univ. of Texas at Austin, Austin, TX, USA
Abstract :
This paper introduces an indexing method for searching diverse data types that is easily parallelizable for use with large data sets. This method, the Adaptive Projection Forest (APF) is a partition-based metric-space indexing method, which provides generic retrieval solutions for data sets for which similarity is defined by a metric-distance function. The APF is uniquely suited to alleviate problems typically encountered in metric-space indexing because it adaptively incorporates exclusion, a method that removes data near a partition boundary and creates multiple trees for use in parallel computing. The use of exclusion allows the index to be more effective when data falls near partition boundaries, where traditional pruning is not always possible. The APF´s use of exclusion also allows it to have greater success in parallel environments, meaning that the APF algorithm can be more effectively used on large data sets with diverse data types. In the APF index, the proportion of excluded data is adjusted dynamically at each index node by locally determining the dimension, k, of the projection of the metric space onto the real numbers. The algorithm, which provides asymptotic algorithmic guarantees for nearest neighbor search, is presented along with a parallel implementation of the APF. Across a suite of real-world and synthetic benchmarks the APF demonstrates favorable empirical results, measured in number of calculations, when compared with the emVP, MVP, and SA indexes. Experiments also reveal that number of calculations can be minimized when a critical parameter, the width of the exclusion region, is set much smaller than the value suggested by asymptotic algorithmic analysis.
Keywords :
indexing; information retrieval; parallel processing; trees (mathematics); very large databases; APF algorithm; APF index; MVP index; SA index; adaptive projection forest; asymptotic algorithmic analysis; asymptotic algorithmic guarantees; diverse data types; emVP index; exclusion region; generic retrieval solution; large data set; metric space indexes; metric-distance function; multiple trees; nearest neighbor search; parallel computing; parallel environment; partition boundary; partition-based metric-space indexing method; pruning; Algorithm design and analysis; Indexing; Partitioning algorithms; Program processors; Throughput; Vegetation;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004283