Title :
GSQCT: A solution to screening gene sequences for phylogenetics analysis
Author :
Meng, Zhen ; Li, Jianhui ; Zhou, Yunchun ; Cao, Wei ; Xiao, Xiao ; Zhao, Jing ; Dong, Hui ; Zhang, Shouzhou
Author_Institution :
Sci. Data Center, Comput. Network Inf. Center, Beijing, China
Abstract :
Screening data for phylogenetic analysis is a known Gordian knot. In this paper, GSQCT (Gene Sequence Quality Control Tool), a solution of screening gene sequence data is promoted. It is firstly to extract initial datasets using of gene annotation information; and then, to calculate the content of the uncertain character from gene sequencing for sequencing quality detection, to detect stop codons to avoid pseudogenes, to detect custom serial strings to remove contaminative sequence fragment, and to do protein similarity calculation with template protein of the object gene for homology detection and finally to decide whether to select by pre-determined threshold range, one by one. The report of the screening result is given and the multiple sequence alignment can be done to verify the homology with those verified sequences. This solution overcomes the existing gene data filtering with problems of error or ambiguous annotations and sequencing accuracy in uneven, which will lead to construct incorrect phylogenetics trees. The evaluation of the solution is introduced and shown well accuracy and effectiveness. Parallel implementation with Hadoop (Map / Reduce) for download: http://www.darwintree.cn/tools.htm
Keywords :
biology computing; genetics; information filtering; molecular biophysics; proteins; quality control; GSQCT; Gordian knot; Hadoop MapReduce implementation; ambiguous annotation error; contaminative sequence fragment removal; custom serial string detection; gene annotation information; gene data filtering; gene sequence data screening; gene sequence quality control tool; homology detection; incorrect phylogenetics trees; multiple sequence alignment; object gene; phylogenetics analysis; protein similarity calculation; protein similarity detection; pseudogene avoidance; sequencing accuracy; sequencing quality detection; stop codon detection; template protein; Accuracy; Bioinformatics; Databases; Phylogeny; Protein sequence; Gene sequence data screening; MapReduce; Phylogenetics;
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
Conference_Location :
Sichuan
Print_ISBN :
978-1-4673-0025-4
DOI :
10.1109/FSKD.2012.6234066