• DocumentCode
    78199
  • Title

    A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

  • Author

    Ruiqi Liao ; Ruichang Zhang ; Jihong Guan ; Shuigeng Zhou

  • Author_Institution
    Shanghai Key Lab. of Intell. Inf. Process., Fudan Univ., Shanghai, China
  • Volume
    11
  • Issue
    1
  • fYear
    2014
  • fDate
    Jan.-Feb. 2014
  • Firstpage
    42
  • Lastpage
    54
  • Abstract
    The rapid development of high-throughput technologies enables researchers to sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these sequence reads into different species or taxonomical classes is a crucial step for metagenomic analysis, which is referred to as binning of metagenomic data. Most traditional binning methods rely on known reference genomes for accurate assignment of the sequence reads, therefore cannot classify reads from unknown species without the help of close references. To overcome this drawback, unsupervised learning based approaches have been proposed, which need not any known species´ reference genome for help. In this paper, we introduce a novel unsupervised method called MCluster for binning metagenomic sequences. This method uses N-grams to extract sequence features and utilizes automatic feature weighting to improve the performance of the basic K-means clustering algorithm. We evaluate MCluster on a variety of simulated data sets and a real data set, and compare it with three latest binning methods: AbundanceBin, MetaCluster 3.0, and MetaCluster 5.0. Experimental results show that MCluster achieves obviously better overall performance ( F-measure) than AbundanceBin and MetaCluster 3.0 on long metagenomic reads ( ≥800 bp); while compared with MetaCluster 5.0, MCluster obtains a larger sensitivity, and a comparable yet more stable F-measure on short metagenomic reads ( bp). This suggests that MCluster can serve as a promising tool for effectively binning metagenomic sequences.
  • Keywords
    biology computing; feature extraction; genomics; microorganisms; sensitivity; sequences; unsupervised learning; AbundanceBin; F-measure; MCluster; MetaCluster 3.0; MetaCluster 5.0; N-grams; automatic feature weighting; basic K-means clustering algorithm; high-throughput technologies; metagenomic analysis; metagenomic data binning; metagenomic sequences; real data set; sampled microbial community; sensitivity; sequence feature extraction; short metagenomic reads; simulated data sets; taxonomical classes; traditional binning methods; unsupervised binning approach; unsupervised learning; Bioinformatics; Clustering algorithms; Feature extraction; Genomics; Sensitivity; Vectors; Metagenomics; N-grams; algorithms; binning; feature weighting;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2013.137
  • Filename
    6654133