• DocumentCode
    2169684
  • Title

    Breaking a time-and-space barrier in constructing full-text indices

  • Author

    Hon, Wing-Kai ; Sadakane, Kunihiko ; Sung, Wing-Kin

  • Author_Institution
    Hong Kong Univ., China
  • fYear
    2003
  • fDate
    11-14 Oct. 2003
  • Firstpage
    251
  • Lastpage
    260
  • Abstract
    Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indices can be constructed in both O(n log n) time and O(n log n)-bit working space, where n denotes the length of the text. In the literature, the fastest algorithm runs in O(n) time, while it requires O(n log n)-bit working space. On the other hand, the most space-efficient algorithm requires O(n)-bit working space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space barrier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n logεn) time and O(n)-bit working space for any 0 < ε < 1. Apart from that, our algorithm can also be adopted to build other existing full-text indices, such as Compressed Suffix Tree, Compressed Suffix Arrays and FM-index. We also study the general case where the size of the alphabet A is not constant. Our algorithm can construct a suffix array and a suffix tree using optimal O(n log |A|)-bit working space while running in O(n log log |A|) time and O(n logεn) time, respectively. These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n).
  • Keywords
    computational complexity; indexing; text analysis; tree data structures; trees (mathematics); FM-index; compressed suffix array; compressed suffix tree; constant-size alphabets; construction algorithm; full-text index; polynomial time; space-efficient algorithm; time-and-space barrier; unit-cost word RAM; Biotechnology; Chromium; Computer science; DNA; Data structures; Indexing; Information technology; Proteins; Read-write memory; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on
  • ISSN
    0272-5428
  • Print_ISBN
    0-7695-2040-5
  • Type

    conf

  • DOI
    10.1109/SFCS.2003.1238199
  • Filename
    1238199