• DocumentCode
    2349751
  • Title

    An unsupervised protein sequences clustering algorithm using functional domain information

  • Author

    Chen, Wei-Bang ; Zhang, Chengcui ; Zhong, Hua

  • Author_Institution
    Department of Computer and Information Sciences, University of Alabama at Birmingham, 35294, USA
  • fYear
    2008
  • fDate
    13-15 July 2008
  • Firstpage
    76
  • Lastpage
    81
  • Abstract
    In this paper, we present an unsupervised novel approach for protein sequences clustering by incorporating the functional domain information into the clustering process. In the proposed framework, the domain boundaries predicated by ProDom database are used to provide a better measurement in calculating the sequence similarity. In addition, we use an unsupervised clustering algorithm as the kernel that includes a hierarchical clustering in the first phase to pre-cluster the protein sequences, and a partitioning clustering in the second phase to refine the clustering results. More specifically, we perform the agglomerative hierarchical clustering on protein sequences in the first phase to obtain the initial clustering results for the subsequent partitioning clustering, and then, a profile Hidden Markove Model (HMM) is built for each cluster to represent the centroid of a cluster. In the second phase, the HMMs based k-means clustering is then performed to refine the cluster results as protein families. The experimental results show our model is effective and efficient in clustering protein families.
  • Keywords
    Biomedical measurements; Clustering algorithms; Clustering methods; Data mining; Databases; Hidden Markov models; Kernel; Merging; Partitioning algorithms; Protein sequence; Data Mining and Knowledge Discovery; ProDom database; Profile Hidden Markov Model (HMM); Protein Sequences Clustering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration, 2008. IRI 2008. IEEE International Conference on
  • Conference_Location
    Las Vegas, NV, USA
  • Print_ISBN
    978-1-4244-2659-1
  • Electronic_ISBN
    978-1-4244-2660-7
  • Type

    conf

  • DOI
    10.1109/IRI.2008.4583008
  • Filename
    4583008