Nested similarity searching for elucidation of evolutionary distant sequences

Author

Jean, Angela ; Lin, Feng ; Tong, Joo Chuan

Author_Institution

Dept. of Biochem., Nat. Univ. of Singapore, Singapore, Singapore

fYear

2010

fDate

6-8 Oct. 2010

Firstpage

266

Lastpage

271

Abstract

Large sets of related gene and protein data are often hauled and examined to deduce their relationships and to provide insight into their evolution. Typically, sequences from primitive organisms would have undergone various mutations to give rise to orthologous sequences in more modern organisms. Heuristic tools are suitable for quick retrieval of similar sequences from databases. However, they are often unable to sieve out sequences that are distant and evolutionarily related. Other tools are pattern-centric and focuses on recurring conserved domains, but they also lack the capability of retrieving sequences of primitive organisms that have evolved through addition - or deletion, of functional domains that are present only in more modern organisms or otherwise. To solve this problem, we devised a new algorithm that performs a nested search on BLAST results. Through this algorithm, we are able to elucidate sequences that would have otherwise eluded a single-pass searching process. In addition, because of the inherent characteristic that each sequence can be related to the query sequence, a path of evolving sequences can be traced. Furthermore, by identifying tasks that can be executed concurrently, the proposed algorithm is parallelized and can be executed in a distributed environment. This prevents the prohibitive running time for large-scale dataset search while ensuring the integrity of the results. Our experiments showed the effectiveness and efficiency of the algorithm running on a multi-processor, distributed environment. While this is a resource intensive process, this is mitigated by pipelining and parallelizing parts of the algorithm that maximize the use of computing resources and minimizing idling time. In addition, iterative searching using full-length sequences ensure that the resultant set of related sequences can be used effectively for evolutionary comparative studies.

Keywords

biology computing; genomics; query formulation; sequences; very large databases; BLAST results; computing resources; databases; distributed environment; elucidation; evolutionary distant sequences; gene data; heuristic tools; iterative searching; large-scale dataset; nested similarity searching; orthologous sequences; parallelizing parts; pipelining parts; protein data; query sequence; resource intensive process; sequences retrieval; single-pass searching process; Algorithm design and analysis; Bioinformatics; Databases; Heuristic algorithms; Organisms; Proteins; Runtime; Bioinformatics; comparative genomics; evolution; parallelized and pipelined computing; similarity searching;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing Systems (SIPS), 2010 IEEE Workshop on

Conference_Location

San Francisco, CA

ISSN

1520-6130

Print_ISBN

978-1-4244-8932-9

Electronic_ISBN

1520-6130

Type

conf

DOI

10.1109/SIPS.2010.5624799

Filename

5624799