DocumentCode :
2826215
Title :
Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences in Biological Datasets
Author :
Pan, Jin ; Wang, Peng ; Wang, Wei ; Shi, Baile ; Yang, Genxing
Author_Institution :
Fudan Univ., Shanghai
fYear :
2005
fDate :
21-23 Sept. 2005
Firstpage :
98
Lastpage :
104
Abstract :
The growth of bioinformatics has resulted in datasets with new characteristics. The DNA sequences typically contain a large number of items. From them biologists assemble a whole genome of species based on frequent concatenate sequences, which ordinarily have hundreds of items. Such datasets pose a great challenge for existing frequent pattern discovery algorithms. Almost all of them are Apriori-like and so have an exponential dependence on the average sequence length. PrefixSpan is the most efficient algorithm, which presented the projection-based sequential pattern-growth approach. However it grows sequential patterns by exploring length-1 frequent patterns and so is not suitable for biological dataset with long frequent concatenate sequences. In this paper, we propose two novel algorithms, called MacosFSpan and MacosVSpan, to mine maximal frequent concatenate sequences. They are specially designed to handle datasets having long frequent concatenate sequences. Our performance study shows that MacosFSpan outperforms the traditional methods with length-1 sequences exploration and MacosVSpan is more efficient than Macos VSpan
Keywords :
data mining; medical information systems; pattern classification; sequences; DNA sequence; MacosFSpan algorithm; MacosVSpan algorithm; PrefixSpan algorithm; bioinformatics; biological datasets; genome; length-1 frequent pattern; maximal frequent concatenate sequence mining; pattern discovery algorithm; projection-based sequential pattern-growth approach; Assembly; Bioinformatics; Biological information theory; Biology; DNA; Databases; Frequency; Genomics; Sequences; Software testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology, 2005. CIT 2005. The Fifth International Conference on
Conference_Location :
Shanghai
Print_ISBN :
0-7695-2432-X
Type :
conf
DOI :
10.1109/CIT.2005.106
Filename :
1562635
Link To Document :
بازگشت