• DocumentCode
    1815301
  • Title

    On optimizing distance-based similarity search for biological databases

  • Author

    Mao, Rui ; Xu, Weijia ; Ramakrishnan, Smriti ; Nuckolls, Glen ; Miranker, Daniel P.

  • Author_Institution
    Dept. of Comput. Sci., Texas Univ., Austin, TX, USA
  • fYear
    2005
  • fDate
    8-11 Aug. 2005
  • Firstpage
    351
  • Lastpage
    361
  • Abstract
    Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.
  • Keywords
    DNA; biochemistry; biology computing; molecular biophysics; multimedia databases; proteins; DNA k-mers; Euclidean norms; Hamming distance; MVP-trees; biological data type; biological database application; biological database search; biological workload; cosine distance; data distribution; data partitioning approach; distance-based indexing; metric PAM model; multimedia; multimedia workload; optimization heuristics; peptide fragmentation spectra; primary driver; protein k-mers; Bioinformatics; Biological system modeling; Biology computing; Data structures; Hamming distance; Indexes; Indexing; Information retrieval; Multimedia databases; Peptides;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Systems Bioinformatics Conference, 2005. Proceedings. 2005 IEEE
  • Print_ISBN
    0-7695-2344-7
  • Type

    conf

  • DOI
    10.1109/CSB.2005.42
  • Filename
    1498036