DocumentCode :
3742343
Title :
Refining high-frequency-queries-based filter for similarity join
Author :
Jaruloj Chongstitvatana;Natthee Thitinanrungkit
Author_Institution :
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand
fYear :
2015
Firstpage :
1
Lastpage :
5
Abstract :
Similarity join and similarity search are important for text databases and data cleaning. Filter-and-verification are applied to reduce the processing time for similarity join and similarity search. High-frequency-queries-based filter partitions a dataset according to the similarity between a data record and a chosen high-frequency-query, and these partitions are stored in a similarity table. In the filter process, data in some rows of a similarity table are selected as candidates. Many high-frequency queries can be used to improve the pruning power. However, the time to choose an appropriate high-frequency query - i.e. to choose an appropriate similarity table - increases with the number of high-frequency queries. This paper proposes a refinement of high-frequency-queries-based filter to reduce the filter time and the number of candidates. To reduce the filter time, inverted lists of high-frequency queries are used to speed up the token counting, which reduces the time for choosing an appropriate similarity table. Binary search in each rows of a similarity table is applied to further eliminate non-candidates. It is shown from the experiments that the refined filter method takes less time and gives better pruning power than the original method.
Keywords :
"Filtering algorithms","Databases","Partitioning algorithms","Computer science","Power filters"
Publisher :
ieee
Conference_Titel :
Computer Science and Engineering Conference (ICSEC), 2015 International
Type :
conf
DOI :
10.1109/ICSEC.2015.7401405
Filename :
7401405
Link To Document :
بازگشت