• DocumentCode
    404826
  • Title

    An efficient incremental protein sequence clustering algorithm

  • Author

    Vijaya, P.A. ; Murty, M. Narasimha ; Subramanian, D.K.

  • Author_Institution
    Dept. of Comput. Sci. & Autom., Indian Inst. of Sci., Bangalore, India
  • Volume
    1
  • fYear
    2003
  • fDate
    15-17 Oct. 2003
  • Firstpage
    409
  • Abstract
    Clustering is the division of data into groups of similar objects. The main objective of this unsupervised learning technique is to find a natural grouping or meaningful partition by using a distance or similarity function. Clustering techniques are applied to reduce data in processing schemes in which the data size is very large. An efficient incremental clustering algorithm, ´leaders-subleaders´, an extension of the leader algorithm, suitable for protein sequences of bioinformatics, is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique to generate a hierarchical structure for finding the subgroups/subclusters within each cluster which may be used to find the superfamily, family and subfamily relationships of protein sequences. The experimental results (classification accuracy using the prototypes obtained and the computation time) of the proposed algorithm are compared with those of the leader-based and nearest neighbour classifier (NNC) methods. It is found to be computationally efficient when compared to NNC. Classification accuracy obtained using the representatives generated by the leaders-subleaders method is found to be better than that of using leaders as representatives and it approaches to that of NNC if sequential search is used on the sequences from the selected subcluster.
  • Keywords
    computational complexity; medical computing; pattern classification; pattern clustering; proteins; sequences; unsupervised learning; bioinformatics; classification accuracy; distance function; incremental clustering algorithm; leader algorithm; leaders-subleaders method; nearest neighbour classifier; pattern classification; protein sequence clustering algorithm; similarity function; unsupervised learning; Bioinformatics; Clustering algorithms; Data analysis; Data mining; Partitioning algorithms; Pattern analysis; Pattern classification; Protein sequence; Prototypes; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    TENCON 2003. Conference on Convergent Technologies for the Asia-Pacific Region
  • Print_ISBN
    0-7803-8162-9
  • Type

    conf

  • DOI
    10.1109/TENCON.2003.1273355
  • Filename
    1273355