Title :
Recovery From Random Samples in a Big Data Set
Author :
Molavipour, Sina ; Gohari, Amin
Author_Institution :
Dept. of Electr. Eng., Sharif Univ. of Technol., Tehran, Iran
Abstract :
Consider a collection of files, each of which is a sequence of letters. One of these files is randomly chosen and a random subsequence of the file is revealed. This random subsequence can be the result of a random sampling of the file. The goal is to recover the identity of the file, assuming a simple greedy matching algorithm to search the file collection. We study the fundamental limits on the maximum size of the file collection for reliable recovery in terms of the length of the random subsequence. The sequence of each file is assumed to follow a hidden Markov model (HMM), which is a common model for many data structures such as voice or DNA sequences. The connection between this problem and coding over a deletion channel with greedy decoders is discussed.
Keywords :
Big Data; file organisation; greedy algorithms; hidden Markov models; random sequences; Big Data set; DNA sequences; HMM; data structures; deletion channel; file collection; file identity; greedy decoders; greedy matching algorithm; hidden Markov model; random file sampling; random file subsequence; random samples; random subsequence length; voice sequences; DNA; Decoding; Hidden Markov models; Indexes; Joints; Markov processes; Upper bound; Hidden Markov model; deletion channel; greedy match; hidden Markov model; search;
Journal_Title :
Communications Letters, IEEE
DOI :
10.1109/LCOMM.2015.2478815