• DocumentCode
    3269051
  • Title

    Engineering a fast online persistent suffix tree construction

  • Author

    Bedathur, Srikanta J. ; Haritsa, Jayant R.

  • Author_Institution
    Database Syst. Lab, Indian Inst. of Sci., Bangalore, India
  • fYear
    2004
  • fDate
    30 March-2 April 2004
  • Firstpage
    720
  • Lastpage
    731
  • Abstract
    Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and subsequent retrievals over the tree. We study these two issues in detail in the context of large genomic DNA and protein sequences. In particular, we make the following contributions: (i) a novel, low-overhead buffering policy called TOP-Q which improves the on-disk behavior of suffix tree construction and subsequent retrievals, and (ii) empirical evidence that the space efficient linked-list representation of suffix tree nodes provides significantly inferior performance when compared to the array representation. These results demonstrate that a careful choice of implementation strategies can make online persistent suffix tree construction considerably more scalable - in terms of length of sequences indexed with a fixed memory budget, than currently perceived.
  • Keywords
    DNA; buffer storage; genetics; proteins; tree data structures; trees (mathematics); TOP-Q; array representation; buffer management policy; genomic DNA; internal node structure; linked-list representation; low-overhead buffering policy; online persistent suffix tree construction; protein sequence; Bioinformatics; Costs; DNA; Database systems; Genetics; Genomics; Indexes; Indexing; Proteins; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2004. Proceedings. 20th International Conference on
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-2065-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2004.1320040
  • Filename
    1320040