Title :
Hamming distance based approximate similarity text search algorithm
Author :
Haifeng Hu ; Liang Zhang ; Jianshen Wu
Author_Institution :
Dept. of Telecommun. & Inf. Eng., Nanjing Univ. of Posts & Telecommun., Nanjing, China
Abstract :
We propose a Hamming distance based approximate similarity text search (HASTS) algorithm to improve the quality of queries in massive text data. The HASTS algorithm first constructs an index table with the substrings extracted randomly from the feature fingerprints generated by the SimHash algorithm. Then, it assigns weights to text terms to reduce the size of the candidate set. Furthermore, the final query result can be obtained by comparing the Hamming distance between the query term and the text terms in the candidate set. Finally, Extensive simulations are conducted to analysis the influence of different parameters on query performance of the HASTS algorithm and compare its performance with the existing search algorithm. The results show that the HASTS algorithm can satisfy the query requirements in massive text data with relatively low overheads.
Keywords :
query processing; text analysis; HASTS algorithm; Hamming distance; SimHash algorithm; approximate similarity text search algorithm; feature fingerprints; index table; massive text data query; query performance; query quality; query requirements; query term; Electronic publishing; Fingerprint recognition; Indexes; Information services; Internet;
Conference_Titel :
Advanced Computational Intelligence (ICACI), 2015 Seventh International Conference on
Conference_Location :
Wuyi
Print_ISBN :
978-1-4799-7257-9
DOI :
10.1109/ICACI.2015.7184772