• DocumentCode
    3225245
  • Title

    Table Compression by Record Intersections

  • Author

    Apostolico, Alberto ; Cunial, Fabio ; Kaul, Vineith

  • Author_Institution
    Georgia Inst. of Technol., Atlanta
  • fYear
    2008
  • fDate
    25-27 March 2008
  • Firstpage
    13
  • Lastpage
    22
  • Abstract
    Saturated patterns with don´t care like those emerged in biosequence motif discovery have proven a valuable notion also in the design of lossless and lossy compression of sequence data. In independent endeavors, the peculiarities inherent to the compression of tables have been examined, leading to compression schemata advantageously hinged on a prudent rearrangement of columns. The present paper introduces off-line table compression by textual substitution in which the patterns used in compression are chosen among models or templates that capture recurrent record subfields. A model record is to be interpreted here as a sequence of intermixed solid and don´t care characters that obeys, in addition, some conditions of saturation: most notably, it must be not possible to replace a don´t care in the model by a solid character without having to forfeit some of its occurrences in the table. Saturation is expected to save on the size of the codebook at the outset, and hence to improve compression. It also induces some clustering of the records in the table, which may present independent interest. Results from preliminary experiments show the savings and potential for classification brought about by this method in connection with a table of specimens collected in a context of biodiversity studies.
  • Keywords
    codes; data compression; biodiversity context; biosequence motif discovery; codebook; compression schemata; data sequence; off-line table compression; pattern saturation; record intersections; table compression; Biodiversity; Biology computing; Data compression; Educational institutions; Feature extraction; Sequences; Solid modeling; Spatial databases; Telecommunication traffic; USA Councils; intrecord; saturated record; table compression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2008. DCC 2008
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-0-7695-3121-2
  • Type

    conf

  • DOI
    10.1109/DCC.2008.105
  • Filename
    4483279