• DocumentCode
    2477772
  • Title

    Index compression through document reordering

  • Author

    Blandford, Dan ; Blelloch, Guy

  • Author_Institution
    Dept. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    342
  • Lastpage
    351
  • Abstract
    An important concern in the design of search engines is the construction of an inverted index. An inverted index, also called a concordance, contains a list of documents (or posting list) for every possible search term. These posting lists are usually compressed with difference coding. Difference coding yields the best compression when the lists to be coded have high locality. Coding methods have been designed to specifically take advantage of locality in inverted indices. Here, we describe an algorithm to permute the document numbers so as to create locality in an inverted index. This is done by clustering the documents. Our algorithm, when applied to the TREC ad hoc database (disks 4 and 5), improves the performance of the best difference coding algorithm we found by fourteen percent. The improvement increases as the size of the index increases, so we expect that greater improvements would be possible on larger datasets.
  • Keywords
    data compression; database indexing; document handling; search engines; TREC ad hoc database; concordance; difference coding; document clustering; document reordering; index compression; inverted index; locality; posting lists; search engines; Chromium; Data compression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2002. Proceedings. DCC 2002
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-1477-4
  • Type

    conf

  • DOI
    10.1109/DCC.2002.999972
  • Filename
    999972