I/O-Efficient Compressed Text Indexes: From Theory to Practice

Author

Chiu, Sheng-Yuan ; Hon, Chiu Wing-Kai ; Shah, Rahul ; Vitter, Jeffrey Scott

Author_Institution

Dept. of Comput. Sci., Nat. Tsing Hua Univ., Hsinchu, Taiwan

fYear

2010

fDate

24-26 March 2010

Firstpage

426

Lastpage

434

Abstract

Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident. Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory. In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs I/O efficiency trade-offs for our structures.

Keywords

data compression; database indexing; string matching; text analysis; tree data structures; data structure; external memory index structure; internal memory index structures; pattern matching; string B-tree; suffix arrays; suffix trees; text data; text index compression; text indexing; Computer networks; Computer science; Data compression; Data structures; Databases; Indexes; Indexing; Pattern matching; Random access memory; Tree data structures; Data Compression; External-Memory Algorithms; I/O-Efficient; Pattern Matching; Text Indexing;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference (DCC), 2010

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

978-1-4244-6425-8

Electronic_ISBN

1068-0314

Type

conf

DOI

10.1109/DCC.2010.45

Filename

5453486