DocumentCode :
2189237
Title :
I/O-Efficient Compressed Text Indexes: From Theory to Practice
Author :
Chiu, Sheng-Yuan ; Hon, Chiu Wing-Kai ; Shah, Rahul ; Vitter, Jeffrey Scott
Author_Institution :
Dept. of Comput. Sci., Nat. Tsing Hua Univ., Hsinchu, Taiwan
fYear :
2010
fDate :
24-26 March 2010
Firstpage :
426
Lastpage :
434
Abstract :
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident. Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory. In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs I/O efficiency trade-offs for our structures.
Keywords :
data compression; database indexing; string matching; text analysis; tree data structures; data structure; external memory index structure; internal memory index structures; pattern matching; string B-tree; suffix arrays; suffix trees; text data; text index compression; text indexing; Computer networks; Computer science; Data compression; Data structures; Databases; Indexes; Indexing; Pattern matching; Random access memory; Tree data structures; Data Compression; External-Memory Algorithms; I/O-Efficient; Pattern Matching; Text Indexing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference (DCC), 2010
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
978-1-4244-6425-8
Electronic_ISBN :
1068-0314
Type :
conf
DOI :
10.1109/DCC.2010.45
Filename :
5453486
Link To Document :
بازگشت