• DocumentCode
    1186214
  • Title

    Properties of embedding methods for similarity searching in metric spaces

  • Author

    Hjaltason, Gísli R. ; Samet, Hanan

  • Author_Institution
    Sch. Comput. Sci., Waterloo Univ., Ont., Canada
  • Volume
    25
  • Issue
    5
  • fYear
    2003
  • fDate
    5/1/2003 12:00:00 AM
  • Firstpage
    530
  • Lastpage
    549
  • Abstract
    Complex data types-such as images, documents, DNA sequences, etc.-are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance function. Often, the cost of evaluating the distance between two objects is very high. Thus, the number of distance evaluations should be kept at a minimum, while (ideally) maintaining the quality of the result. One way to approach this goal is to embed the data objects in a vector space so that the distances of the embedded objects approximates the actual distances. Thus, queries can be performed (for the most part) on the embedded objects. We are especially interested in examining the issue of whether or not the embedding methods will ensure that no relevant objects are left out. Particular attention is paid to the SparseMap, FastMap, and MetricMap embedding methods. SparseMap is a variant of Lipschitz embeddings, while FastMap and MetricMap are inspired by dimension reduction methods for Euclidean spaces. We show that, in general, none of these embedding methods guarantee that queries on the embedded objects have no false dismissals, while also demonstrating the limited cases in which the guarantee does hold. Moreover, we describe a variant of SparseMap that allows queries with no false dismissals. In addition, we show that with FastMap and MetricMap, the distances of the embedded objects can be much greater than the actual distances. This makes it impossible (or at least impractical) to modify FastMap and MetricMap to guarantee no false dismissals.
  • Keywords
    multimedia databases; query processing; singular value decomposition; DNA sequences; Euclidean spaces; FastMap; Lipschitz embeddings; MetricMap; SparseMap; complex data types; contractiveness; dimension reduction methods; distance evaluations; distortion; documents; embedding methods; images; metric spaces; multimedia databases; similarity search; similarity searching; singular value decomposition; Application software; Computer Society; Costs; DNA; Design automation; Extraterrestrial measurements; Image databases; Multimedia databases; Proteins; Sequences;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/TPAMI.2003.1195989
  • Filename
    1195989