Tradeoffs in XML database compression

Author

Cheney, James

Author_Institution

Edinburgh Univ., UK

fYear

2006

fDate

28-30 March 2006

Firstpage

392

Lastpage

401

Abstract

Large XML data files, or XML databases, are now a common way to distribute scientific and bibliographic data, and storing such data efficiently is an important concern. A number of approaches to XML compression have been proposed in the last five years. The most competitive approaches employ one or more statistical text compressors based on PPM or arithmetic coding in which some of the context is provided by the XML document structure. The purpose of this paper is to investigate the relationship between the extant proposals in more detail. We review the two main statistical modeling approaches proposed so far, and evaluate their performance on two representative XML databases. Our main finding is that while a recently-proposed multiple-model approach can provide better overall compression for large databases, it uses much more memory and converges more slowly than an older single-model approach.

Keywords

XML; arithmetic codes; data compression; database management systems; statistical analysis; XML data files; XML database compression; XML document structure; arithmetic coding; multiple-model approach; statistical modeling approaches; statistical text compressors; Arithmetic; Compressors; Containers; Context modeling; Data compression; Databases; Proposals; Proteins; Switches; XML;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference, 2006. DCC 2006. Proceedings

ISSN

1068-0314

Print_ISBN

0-7695-2545-8

Type

conf

DOI

10.1109/DCC.2006.79

Filename

1607274