Title :
Using geometric structures to improve the error correction algorithm of high-throughput sequencing data on MapReduce framework
Author :
Wei-Chun Chung ; Yu-Jung Chang ; Lee, D.T. ; Jan-Ming Ho
Author_Institution :
Inst. of Inf. Sci., Taipei, Taiwan
Abstract :
Next-generation sequencing (NGS) data are a rapidly growing example of big data and a source of new knowledge in science. However, sequencing errors remain unavoidable and reduce the quality of NGS data. Error correction, therefore, is a critical step in the successful utilization of NGS data, including de novo genome assembly and DNA resequencing. Since NGS throughput doubles approximately every five months and the length of NGS records (i.e., reads) is increasing, improvements in efficiency and effectiveness of computational strategies are needed. In this study, we aim to improve the performance of CloudRS, an open-source MapReduce application designed to correct sequencing errors in NGS data. We introduce the readmessage (RM) diagram to represent the set of messages, i.e., the key-value pairs generated on each read. We also present the Gradient-number Votes (GNV) scheme in order to trim off portions of the RM diagram, thereby reducing the total size of messages associated with each read. Experimental results show that the GNV scheme successfully reduce execution time and improve the quality of the de novo genome assembly.
Keywords :
Big Data; bioinformatics; diagrams; error correction; genetics; Big Data; CloudRS; GNV; MapReduce framework; NGS data; RM diagram; error correction algorithm; geometric structures; gradient-number votes; next-generation sequencing data; readmessage diagram; Assembly; Big data; Bioinformatics; DNA; Error correction; Genomics; Sequential analysis; big data; error correction; geometric structure; mapreduce; next-generation sequencing;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004306