Title :
Parallel pattern identification in biological sequences on clusters
Author :
Huang, Chun-Hsi ; Rajasekaran, Sanguthevar
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Connecticut, Storrs, CT, USA
fDate :
3/1/2003 12:00:00 AM
Abstract :
Tandem repeats are ubiquitous sequence features in both prokaryotic and eukaryotic genomes. They are known to cause several inherited neurological diseases in humans. Identifying these patterns is a highly computation-intensive process. Previous parallel implementations use straightforward domain decomposition based on existing sequential algorithms and rely on parallel machines with low-latency interconnection network and fast hardware support for processor synchronization. Our research exploits the superior cost effectiveness and flexibility achieved through low-cost clusters to speed up biological computations by designing communication-efficient parallel algorithms for pattern identification. This paper presents a low communication-overhead parallel algorithm for pattern identification in biological sequences. Given a biological sequence of length n and a pattern of length m, we conclude an algorithm with five computation/communication phases, each requiring O(n) computation time and only O(p) message units. The low communication overhead of the algorithm is essential in achieving reasonable speedups on clusters, where the interprocessor communication latency is usually higher.
Keywords :
biology computing; diseases; genetics; molecular biophysics; parallel algorithms; pattern matching; sequences; biological computations; biological sequences; clusters; communication-efficient parallel algorithms; cost effectiveness; eukaryotic genomes; five computation/communication phases; flexibility; highly computation-intensive process; humans; inherited neurological diseases; interprocessor communication latency; low communication-overhead parallel algorithm; low-cost clusters; message units; parallel pattern identification; prokaryotic genomes; tandem repeats; ubiquitous sequence features; Bioinformatics; Biology computing; Clustering algorithms; Diseases; Genomics; Humans; Multiprocessor interconnection networks; Parallel algorithms; Parallel machines; Pervasive computing; Algorithms; Cluster Analysis; Computer Communication Networks; Computing Methodologies; Gene Expression Profiling; Pattern Recognition, Automated; Sequence Alignment; Sequence Analysis, DNA; Tandem Repeat Sequences;
Journal_Title :
NanoBioscience, IEEE Transactions on
DOI :
10.1109/TNB.2003.810165