Title :
Compression of Boolean inverted files by document ordering
Author :
Gelbukh, Alexander ; Han, Sangyong ; Sidorov, Grigori
Author_Institution :
Comput. Res. Center, Nat. Polytech. Inst., Zacatenco, Mexico
Abstract :
Boolean queries are used to search a document collection for the documents that contain specific terms, independently of the frequency of a term in the document. To perform such queries, a search engine maintains an inverted file, which lists for each keyword the documents containing it. The size of such a file is comparable with that of the document collection, which is a considerable storage overhead. We show how the inverted file can be compressed by ordering the documents in the collection in a specific way. Finding the near-optimal order can be recast as a Hamming-distance traveling salesman problem.
Keywords :
Boolean algebra; data compression; document handling; information retrieval; search engines; travelling salesman problems; Boolean inverted file compression; Boolean query; Boolean search; Hamming-distance traveling salesman problem; document ordering; information retrieval; near-optimal order; search engine; Computer science; Frequency; Influenza; Information retrieval; Internet; Maintenance engineering; Natural languages; Pressing; Search engines; Traveling salesman problems;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location :
Beijing, China
Print_ISBN :
0-7803-7902-0
DOI :
10.1109/NLPKE.2003.1275907