Title :
Bio-M: Data mining on HCV genotype 1 core sequences
Author :
Rakshmy, C.S. ; Abdul Nazeer, K.A. ; Chandra, S. S. Vinod
Author_Institution :
Dept. of Comput. Sci., Nat. Inst. of Technol., Calicut, India
Abstract :
Hepatitis C Virus (HCV) has become a major risk factor for the development of Hepatocellular Carcinoma (HCC). A framework has been developed to identify genomic markers associated with HCC of HCV sequences, which comprises of clustering, feature selection and classification. A new method for feature extraction for genomic sequences rooted in Hash tables has been proposed. It requires less memory compared to Generalized Suffix Tree based methods. Biomarkers are selected as features and Random Forest (RF) Classifier is learned by means of these biomarkers. RF is used to classify HCV sequences with and without HCC. Using the HCV sequence data available from the European HCV Database (euHCVdb) and Los Alamos National Laboratory, we show that performance of RF is comparable with SVM classifier.
Keywords :
biology computing; data mining; diseases; feature extraction; genomics; medical computing; microorganisms; pattern classification; pattern clustering; support vector machines; tree data structures; trees (mathematics); Bio-M; European HCV Database; HCV genotype 1 core sequences; Los Alamos National Laboratory; SVM classifier; data mining; euHCVdb; feature extraction; generalized suffix tree based methods; genomic marker identification; genomic sequences; hash tables; hepatitis C virus; hepatocellular carcinoma; random forest classifier; Accuracy; Bioinformatics; Feature extraction; Genomics; Radio frequency; Vegetation;
Conference_Titel :
Data Science & Engineering (ICDSE), 2012 International Conference on
Conference_Location :
Cochin, Kerala
Print_ISBN :
978-1-4673-2148-8
DOI :
10.1109/ICDSE.2012.6282307