An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System

Author

Jiang, Karl ; Thorsen, Oystein ; Peters, Amanda ; Smith, Brian ; Sosa, Carlos P.

Author_Institution

IBM, Rochester

Volume

19

Issue

1

fYear

2008

Firstpage

15

Lastpage

23

Abstract

Bioinformatics databases used for sequence comparison and sequence alignment are growing exponentially. This has popularized programs that carry out database searches. Current implementations of sequence alignment methods based on hidden Markov models (HMM) have proven to be computationally intensive and, hence, amenable to architectures with multiple processors. In this paper, we describe a modified version of the original parallel implementation of HMMs on a massively parallel system. This is part of the HMMER bioinformatics code. HMMER 2.3.2 uses profile HMMs for sensitive database searching based on statistical descriptions of a sequence family´s consensus (Durbin et al., 1998), Two of the nine programs were further parallelized to take advantage of the large number of processors, namely, hmmsearch and hmmpfam. For our study, we start by porting the parallel virtual machine (PVM) versions of these two programs currently available as part of the HMMER suite of programs. We report the performance of these nonoptimized versions as baselines. Our work also includes the introduction of an alternate sequence file indexing, multiple-master configuration, dynamic data collection and, finally, load balancing via the indexed sequence files. This set of optimizations constitutes our modified version for massively parallel systems. Our results show parallel performance improvements of more than one order of magnitude (16 times) for hmmsearch and hmmpfam.

Keywords

biology computing; database indexing; genetics; hidden Markov models; resource allocation; virtual machines; HMMER 2.3.2; HMMER bioinformatics code; alternate sequence file indexing; bioinformatics databases; database searches; dynamic data collection; genomic sequence search; hidden Markov models; hmmpfam; hmmsearch; load balancing; massively parallel systems; multiple processors; multiple-master configuration; nonoptimized versions; parallel virtual machine; sensitive database searching; sequence comparison; HMMER; Hidden Markov models; bioinformatics.; genomic sequence-search; massively parallel systems; multiple master parallelization; parallel implementation;

fLanguage

English

Journal_Title

Parallel and Distributed Systems, IEEE Transactions on

Publisher

ieee

ISSN

1045-9219

Type

jour

DOI

10.1109/TPDS.2007.70712

Filename

4359412