مرکز منطقه ای اطلاع رساني علوم و فناوري - Fast Matching for All Pairs Similarity Search

DocumentCode :

1867856

Title :

Fast Matching for All Pairs Similarity Search

Author :

Awekar, Amit ; Samatova, Nagiza F.

Volume :

fYear :

2009

fDate :

15-18 Sept. 2009

Firstpage :

295

Lastpage :

300

Abstract :

All pairs similarity search is the problem of finding all pairs of records that have a similarity score above the specified threshold. Many real-world systems like search engines, online social networks, and digital libraries frequently have to solve this problem for datasets having millions of records in a high dimensional space, which are often sparse. The challenge is to design algorithms with feasible time requirements. To meet this challenge, algorithms have been proposed based on the inverted index, which maps each dimension to a list of records with non-zero projection along that dimension. Common to these algorithms is a three-phase framework of data preprocessing, pairs matching, and indexing. Matching is the most time-consuming phase. Within this framework, we propose fast matching technique that uses the sparse nature of real-world data to effectively reduce the size of the search space through a systematic set of tighter filtering conditions and heuristic optimizations. We integrate our technique with the fastest-to-date algorithm in the field and achieve up to 6.5X speed-up on three large real-world datasets.

Keywords :

Algorithm design and analysis; Conferences; Data preprocessing; Indexing; Intelligent agent; Laboratories; Search engines; Social network services; Software libraries; USA Councils; data mining; inverted index; similarity search;

fLanguage :

English

Publisher :

iet

Conference_Titel :

Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM International Joint Conferences on

Conference_Location :

Milan, Italy

Print_ISBN :

978-0-7695-3801-3

Electronic_ISBN :

978-1-4244-5331-3

Type :

conf

DOI :

10.1109/WI-IAT.2009.52

Filename :

5286059

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1867856