• DocumentCode
    1092726
  • Title

    Strategies for Identifying Statistically Significant Dense Regions in Microarray Data

  • Author

    Yip, A.M. ; Ng, M.K. ; Wu, E.H. ; Chan, T.F.

  • Author_Institution
    Nat. Univ. of Singapore, Singapore
  • Volume
    4
  • Issue
    3
  • fYear
    2007
  • Firstpage
    415
  • Lastpage
    429
  • Abstract
    We propose and study the notion of dense regions for the analysis of categorized gene expression data and present some searching algorithms for discovering them. The algorithms can be applied to any categorical data matrices derived from gene expression level matrices. We demonstrate that dense regions are simple but useful and statistically significant patterns that can be used to 1) identify genes and/or samples of interest and 2) eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. Some theoretical studies on the properties of the dense regions are presented which allow us to characterize dense regions into several classes and to derive tailor-made algorithms for different classes of regions. Moreover, an empirical simulation study on the distribution of the size of dense regions is carried out which is then used to assess the significance of dense regions and to derive effective pruning methods to speed up the searching algorithms. Real microarray data sets are employed to test our methods. Comparisons with six other well-known clustering algorithms using synthetic and real data are also conducted which confirm the superiority of our methods in discovering dense regions. The DRIFT code and a tutorial are available as supplemental material, which can be found on the Computer Society Digital Library at http://computer.org/tcbb/archives.htm.
  • Keywords
    arrays; biology computing; genetics; molecular biophysics; pattern clustering; statistical analysis; Computer Society Digital Library; DRIFT code; categorical data matrices; categorized gene expression data analysis; clustering algorithms; coexpressed genes; data abnormalities; data noise; data outliers; effective pruning methods; gene elimination; gene expression level matrices; gene identification; microarray data dense regions; searching algorithms; Algorithm design and analysis; Analysis of variance; Bayesian methods; Clustering algorithms; Computer Society; Conducting materials; Data analysis; Data mining; Gene expression; Testing; Dense region; bicluster; categorical data; clustering; coexpressed genes.; gene expression; microarray; Algorithms; Cluster Analysis; Data Interpretation, Statistical; Database Management Systems; Databases, Protein; Gene Expression Profiling; Information Storage and Retrieval; Multigene Family; Oligonucleotide Array Sequence Analysis;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2007.1022
  • Filename
    4288067