Title :
A machine learning approach to identify DNA replication proteins from sequence-derived features
Author :
Runtao Yang ; Chengjin Zhang ; Rui Gao ; Lina Zhang
Author_Institution :
Sch. of Control Sci. & Eng., Shandong Univ., Jinan, China
Abstract :
DNA replication, a critical step in cell division and proliferation, is a process of producing two identical replicas from one original DNA molecule. Although great advances have been made in DNA replication research, the detailed mechanism of DNA replication is still unresolved. Faithful DNA replication requires the cooperation of many proteins. Failures in DNA replication leave mutations in the genome, which can cause cancers and other diseases. Therefore, accurately identifying these important DNA replication proteins may assist in understanding the molecular mechanisms of DNA replication and drug development. As the experimental methods are expensive and labor intensive, it is highly desired to develop an accurate computational method for identifying DNA replication proteins. In this paper, a machine learning approach to identify DNA replication proteins has been developed using a Naïve Bayes classifier and sequence-derived features. The prediction performance of features extracted from the Reduced Amino Acid Composition (RAAC) and two Pseudo Amino Acid Composition (PseAAC) models is investigated, respectively. Prediction results indicate that the PseAAC (type 2) model yields the best performance. Then, based on the PseAAC (type 2) model, we compare our method with the similarity search method on the independent test dataset. The comparison results reveal that it is feasible to identify DNA replication proteins by machine learning algorithms. The proposed method may provide candidate DNA replication proteins for future experimental verification to assist in understanding the molecular mechanisms of DNA replication and drug development for the treatment of human diseases.
Keywords :
Bayes methods; DNA; biology computing; drugs; genetics; learning (artificial intelligence); proteins; DNA molecule; DNA replication protein; cell division; drug development; genome; machine learning; molecular mechanism; naive Bayes classifier; pseudo amino acid composition; reduced amino acid composition; sequence-derived feature; Accuracy; Amino acids; DNA; Diseases; Feature extraction; Proteins; Sensitivity;
Conference_Titel :
Electrical and Computer Engineering (CCECE), 2015 IEEE 28th Canadian Conference on
Conference_Location :
Halifax, NS
Print_ISBN :
978-1-4799-5827-6
DOI :
10.1109/CCECE.2015.7129092