مرکز منطقه ای اطلاع رساني علوم و فناوري - Fast approximate similarity search in extremely high-dimensional data sets

DocumentCode :

2848206

Title :

Fast approximate similarity search in extremely high-dimensional data sets

Author :

Houle, Michael E. ; Sakuma, Jun

Author_Institution :

Nat. Inst. of Informatics, Tokyo, Japan

fYear :

2005

fDate :

5-8 April 2005

Firstpage :

619

Lastpage :

630

Abstract :

This paper introduces a practical index for approximate similarity queries of large multi-dimensional data sets: the spatial approximation sample hierarchy (SASH). A SASH is a multi-level structure of random samples, recursively constructed by building a SASH on a large randomly selected sample of data objects, and then connecting each remaining object to several of their approximate nearest neighbors from within the sample. Queries are processed by first locating approximate neighbors within the sample, and then using the pre-established connections to discover neighbors within the remainder of the data set. The SASH index relies on a pairwise distance measure, but otherwise makes no assumptions regarding the representation of the data. Experimental results are provided for query-by-example operations on protein sequence, image, and text data sets, including one consisting of more than 1 million vectors spanning more than 1.1 million terms - far in excess of what spatial search indices can handle efficiently. For sets of this size, the SASH can return a large proportion of the true neighbors roughly 2 orders of magnitude faster than sequential search.

Keywords :

data structures; distributed databases; query formulation; query processing; very large databases; approximate nearest neighbors; data objects; fast approximate similarity search; large multidimensional data sets; query processing; spatial approximation sample hierarchy; Buildings; Content based retrieval; Data mining; Extraterrestrial measurements; Informatics; Information retrieval; Joining processes; Nearest neighbor searches; Protein sequence; Spatial databases;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on

ISSN :

1084-4627

Print_ISBN :

0-7695-2285-8

Type :

conf

DOI :

10.1109/ICDE.2005.66

Filename :

1410179

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2848206