Common substring in multiple sequences using hash based technique

Author

Dheenadayalan, Kumar ; Muralidhara, V.N. ; Katru, Jayakrishna

Author_Institution

Int. Inst. of Inf. Technol., Bangalore, India

fYear

2013

fDate

23-26 June 2013

Firstpage

140

Lastpage

145

Abstract

Searching for the longest common substring in multiple sequences is of great practical application in the field of Bioinformatics. Two memory efficient solutions to the problem of finding common substrings in multiple sequences are proposed in this paper. First algorithm is a combination of hashing technique and Suffix Tree to find common substrings in long DNA or Protein sequences. This algorithm is three times more memory efficient when compared to other alternate data structures. k-Truncated Suffix Tree, a variation of Suffix Tree was proposed recently to find common substrings for short sequences. The second algorithm uses hashing with separate chaining for short sequences which offers a memory advantage of around 10 times when compared to k-truncated Suffix Tree. These algorithms also offer a great potential for parallelization of the search process which can improve the run time of the search by a large factor.

Keywords

DNA; bioinformatics; molecular biophysics; proteins; string matching; tree data structures; tree searching; bioinformatics; data structures; hash-based technique; long DNA sequences; longest common substring search; multiple sequences; protein sequences; search process parallelization; short sequences; truncated suffix tree; Bioinformatics; Genomics; Irrigation; bioinformatics; hashing; k-truncated suffix tree; longest common substring; suffix tree;

fLanguage

English

Publisher

ieee

Conference_Titel

Technology, Informatics, Management, Engineering, and Environment (TIME-E), 2013 International Conference on

Conference_Location

Bandung

Print_ISBN

978-1-4673-5730-2

Type

conf

DOI

10.1109/TIME-E.2013.6611980

Filename

6611980