Title :
An efficient sequential pattern mining algorithm for motifs with gap constraints
Author :
Liao, Vance Chiang-Chi ; Chen, Ming-Syan
Author_Institution :
Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan
Abstract :
Mining biological data can provide insight into various realms of biology, such as finding co-occurring biosequences, which is essential for biological analyses and data mining. Sequential pattern mining reveals all-length implicit motifs, which have specific structures and are of functional significance in biological sequences. Traditional sequential pattern mining algorithms are inefficient for small alphabets and long sequences, such as DNA and protein sequences; therefore, it is necessary to move away from these algorithms. An approach called the Depth-First Spelling algorithm for mining sequential patterns (motifs) with Gap constraints in biological sequences (referred to as DFSG) is proposed in this work. In biological sequences, DFSG runtime is substantially shorter than that of GenPrefixSpan, where GenPrefixSpan is a method based on PrefixSpan (PrefixSpan is one of the fastest algorithms in traditional sequential pattern mining algorithms).
Keywords :
DNA; RNA; bioinformatics; data mining; molecular biophysics; molecular configurations; DFSG runtime; DNA sequences; biological analysis; biological data mining; biological sequences; depth-first spelling algorithm; functional significance; gap constraints; genprefixspan process; protein sequences; realms; traditional sequential pattern mining algorithms; Algorithm design and analysis; Classification algorithms; DNA; Data mining; Proteins; Runtime; data mining; sequential patterns;
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on
Conference_Location :
Philadelphia, PA
Print_ISBN :
978-1-4673-2559-2
Electronic_ISBN :
978-1-4673-2558-5
DOI :
10.1109/BIBM.2012.6392660