مرکز منطقه ای اطلاع رساني علوم و فناوري - TJJE: An efficient algorithm for top-k join on massive data

Abstract :

In many applications, top-k join is an important operation to return the k most important join tuples among the potentially huge answer space according to a given ranking function. PBRJ is an algorithm template that generalizes previous top-k join algorithms. In this paper, our analysis shows that PBRJ needs to maintain a large quantity of candidate tuples on massive data. Based on the analysis, this paper proposes a novel top-k join algorithm TJJE which is suitable for handling massive data. By some pre-computed information, TJJE first estimates an upper-bound on scan depth of each joined table. Then it determines the file that contains the join positional index pairs of the top-k join results. A novel algorithm is proposed to retrieve the required join tuples by a single sequential and selective scan on the joined tables. Finally, the top-k join results are obtained by a single scan on the retrieved join tuples. The correctness proof and cost analysis of TJJE are presented in this paper. Extensive experiments show that TJJE maintains up to three orders of magnitude fewer candidate tuples and obtains up to one order of magnitude speedup compared to PBRJ.