Title :
A MinHash Approach for Clustering Large Collections of Binary Programs
Author_Institution :
Bitdefender, Tech. Univ. of Cluj-Napoca, Cluj-Napoca, Romania
Abstract :
Clustering large collections of binary programs is a challenging task due to two factors. First of all, a way to determine if two samples are similar or not is required. Secondly, pair wise comparison is impractical on collections comprising millions of items. This paper will mainly focus on the second factor and will propose a clustering algorithm based on the properties of MinHash functions. The algorithm will comprise of several iterations, where such functions are used to partition the collection of samples into smaller groups, such that elements of the same group are likely to be similar. Several heuristics will be proposed in order to tune up the algorithm performance while maintaining quality results. The experimental evaluation showed that the proposed solution can cluster 10 million binary programs in less than two and a half hours.
Keywords :
feature extraction; file organisation; pattern clustering; Boolean feature extraction; MinHash function; binary program clustering; binary programs; clustering algorithm; sample collection partitioning; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Couplings; Feature extraction; Heuristic algorithms; Partitioning algorithms; MinHash; binary code analysis; clustering; locality-sensitive hashing; single linkage;
Conference_Titel :
Control Systems and Computer Science (CSCS), 2015 20th International Conference on
Conference_Location :
Bucharest
Print_ISBN :
978-1-4799-1779-2
DOI :
10.1109/CSCS.2015.27