DocumentCode :
2290781
Title :
A signature technique for similarity-based queries
Author :
Faloutsos, C. ; Jagadish, H.V. ; Mendelzon, A.O. ; Milo, T.
Author_Institution :
Maryland Univ., MD, USA
fYear :
1997
fDate :
11-13 Jun 1997
Firstpage :
2
Lastpage :
20
Abstract :
Jagadish et al. (see Proc. ACM SIGACT-SIGMOD-SIGART PODS, p.36-45, 1995) developed a general framework for posing queries based on similarity. The framework enables a formal definition of the notion of similarity for an application domain of choice, and then its use in queries to perform similarity-based search. We adapt this framework to the specialized domain of real-valued sequences. (Although some of the ideas we present are applicable to other types of data as well). In particular we focus on whole-match queries. By whole-match query we mean the case where the user has to specify the whole sequence. Similarity-based search can be computationally very expensive. The computation cost depends heavily on the length of sequences being compared. To make such similarity testing feasible on large data sets, we propose the use of a signature based technique. In a nutshell, our approach is to “shrink” the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated. Being shorter, signatures can usually be compared much faster than the original sequences. In addition, signatures are usually easier to index. For such a signature-based technique to be effective one has to assure that (1) the signature comparison is fast, and (2) the signature comparison gives few false alarms, and no false dismissals. We obtain measures of goodness for our technique. The technique is illustrated with a couple of very different examples
Keywords :
information retrieval; query processing; sequences; computation cost; data sequences; false alarms; index; large data sets; real-valued sequences; sequence length; signature technique; similarity testing; similarity-based queries; similarity-based search; whole-match queries; Bioinformatics; Costs; Databases; Euclidean distance; Extraterrestrial measurements; Fourier transforms; Genomics; Image retrieval; Proteins; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Compression and Complexity of Sequences 1997. Proceedings
Conference_Location :
Salerno
Print_ISBN :
0-8186-8132-2
Type :
conf
DOI :
10.1109/SEQUEN.1997.666899
Filename :
666899
Link To Document :
بازگشت