• DocumentCode
    3134198
  • Title

    BASS: approximate search on large string databases

  • Author

    Yang, Jiong ; Wang, Wei ; Yu, Philip

  • Author_Institution
    UIUC, Urbana, IL, USA
  • fYear
    2004
  • fDate
    21-23 June 2004
  • Firstpage
    181
  • Lastpage
    190
  • Abstract
    In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.
  • Keywords
    database indexing; database theory; query processing; string matching; symbolic substitution; tree data structures; tree searching; very large databases; BASS-tree; approximate search; approximate substring matching; balanced tree; index structure; large string databases; numerical space; query pattern; sublinear time; subsequent segments; DNA; Databases; Decoding; Delay; Genetic mutations; Indexes; Indexing; Pattern matching; Proteins; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
  • ISSN
    1099-3371
  • Print_ISBN
    0-7695-2146-0
  • Type

    conf

  • DOI
    10.1109/SSDM.2004.1311210
  • Filename
    1311210