• DocumentCode
    3248922
  • Title

    Efficient similarity queries via lossy compression

  • Author

    Ochoa, Idoia ; Ingber, Amir ; Weissman, Tsachy

  • Author_Institution
    Dept. of Electr. Eng., Stanford Univ., Stanford, CA, USA
  • fYear
    2013
  • fDate
    2-4 Oct. 2013
  • Firstpage
    883
  • Lastpage
    889
  • Abstract
    The generation of new databases and the amount of data on existing ones is growing exponentially. As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can be performed efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied and characterized in past work, how to approach these limits in practice has remained largely unexplored. In this paper, we propose an approach to this task, based in part on existing lossy compression algorithms. Specifically, we consider queries of the form: “which sequences in the database are similar to a given sequence y?”. For the case where similarity between sequences is measured via Hamming distortion, we construct schemes whose performance is close to the fundamental limits. Furthermore, we test our scheme on a sample database of real DNA sequences, and show significant compression while still allowing highly reliable query answers.
  • Keywords
    DNA; bioinformatics; data compression; database management systems; query processing; DNA sequences; Hamming distortion; compressed database; compression rate; efficient similarity queries; lossy compression; reliable query answers; sequence compression; Compression algorithms; Computer architecture; DNA; Databases; Joints; Reliability; Simulation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on
  • Conference_Location
    Monticello, IL
  • Print_ISBN
    978-1-4799-3409-6
  • Type

    conf

  • DOI
    10.1109/Allerton.2013.6736618
  • Filename
    6736618