Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Massachusetts, Lowell, MA, USA
Abstract :
In many counter-terrorism, or natural disasters, geographically distributed large scale sensor-based bio-chemicals agent or microorganisms target identification and prediction applications, such as in WMD events, as well as in many health care and medical applications, efficient large scale metagenomics is crucial for purpose such as rapid and timely decontamination for normal environment restoration, or rapid and timely discovery of the right drug/therapy for the injured individuals. Metagenomics is the study of all bio-chemical and organisms collected directly from large natural environments including geographically distributed disastrous ones. Most of these collected bio-chemical and organisms cannot be cultivated in a laboratory and hence cannot be sequenced as individual organisms. Thus, metagenomics methods allow relatively rapid sequencing of organism genomes obtained directly from a natural environment, and which cannot be cultured in a laboratory. Sequencing random fragments obtained from whole genome shotgun into taxa-based groups is known as binning. Currently, there are two different methods of binning: sequence similarity methods and sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known genome like BLAST, or MEGAN. Sequence composition methods are based on compositional features of a given DNA sequence like K-mers, or other genomic signature(s) like TETRA, Phylopythia, CompostBin, likelyBin. In this paper we propose a machine learning predictive DNA sequence feature selection algorithms to solve binning problems. In our prior work we showed feature selection/reduction and binning prediction based on direct nucleotide k-mers. Here we use a combination of 2 Codons (amino acids) as features to differentiate between sequences. There are 20 different amino acids which are found proteins. The combination of 2 amino acids produce 400 features which we use to differentiate between the metagenomics sequence. - he data reads used in this work has an average length of 250, 500, 1000, and 2000 base pairs. Experimental results of the codon-based feature reduction and binning prediction algorithms, namely using respectively a Random forest classifier and a Bayes classifier, are presented along with their comparison to their DNA-based k-mers counterparts. The proposed algorithm accuracy is tested on a variety of data sets and our findings show that the classification/prediction accuracy achieved is between 59%-92% for various data sets using Random forest classifier and 44%-64% using Naïve Bayes classifier. Random forest Classifier did better in classification in all the datasets compared to Naïve Bayes.
Keywords :
biology computing; genomics; learning (artificial intelligence); molecular biophysics; pattern classification; BLAST genome; Bayes classifier; CompostBin genomes; DNA sequence; MEGAN genome; Phylopythia genomes; TETRA genomes; WMD applications; WMD events; amino acids; binning methods; counter-terrorism; data reads; direct nucleotide k-mers; drug discovery; health care applications; likelyBin genomes; medical applications; metagenomics sequence; microorganisms; natural disasters; organism genomes; predictive DNA-codon metagenomics; proteins; random forest classifier; random-forest-based comparative machine learning; sensor-based biochemicals agent; sequence composition methods; sequence similarity methods; Accuracy; Amino acids; DNA; Genomics; Organisms; Sequential analysis; Testing; Binning; Bioinformatics; Computational intelligence; Machine learning; Metagenomics; Next generation Sequencing; Pattern Classification; Random forest; Reduction methods; bagged decision tree; codon; forwaord sequential feature selection; k-mers;