مرکز منطقه ای اطلاع رساني علوم و فناوري - Attribute-Based Subsequence Matching and Mining

Abstract :

Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer\´s movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like "Director" and "Actors". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mini- g problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.

Keywords :

data analysis; data mining; query processing; sequences; attribute-based subsequence matching; attribute-based subsequence mining; bioinformatics; data mining problems; frequent attribute-based subsequence; frequent sequence mining; motif detection; movie rental application; movie rental record; query sequence; sequence analysis; sequence classification; sequence matching; Approximation algorithms; Bioinformatics; Biology; Communities; Data mining; Equations; Motion pictures;