• DocumentCode
    3409237
  • Title

    Biclustering gene-feature matrices for statistically significant dense patterns

  • Author

    Koyutürk, Mehmet ; Szpankowski, Wojciech ; Grama, Ananth

  • Author_Institution
    Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
  • fYear
    2004
  • fDate
    16-19 Aug. 2004
  • Firstpage
    480
  • Lastpage
    484
  • Abstract
    Biclustering is an important problem that arises in diverse applications, including analysis of gene expression and drug interaction data. The problem can be formalized in various ways through different interpretation of data and associated optimization functions. We focus on the problem of finding unusually dense patterns in binary (0-1) matrices. This formulation is appropriate for analyzing experimental datasets that come from not only binary quantization of gene expression data, but also more comprehensive datasets such as gene-feature matrices that include functions of coded proteins and motifs in the coding sequence. We formalize the notion of an "unusually" dense submatrix to evaluate the interestingness of a pattern in terms of statistical significance based on the assumption of a uniform memoryless source. We then simplify it to assess statistical significance of discovered patterns. Using statistical significance as an objective function, we formulate the problem as one of finding significant dense submatrices of a large sparse matrix. Adopting a simple iterative heuristic along with randomized initialization techniques, we derive fast algorithms for discovering binary biclusters. We conduct experiments on a binary gene-feature matrix and a quantized breast tumor gene expression matrix. Our experimental results show that the proposed method quickly discovers all interesting patterns in these datasets.
  • Keywords
    biological organs; genetics; iterative methods; medical computing; molecular biophysics; pattern clustering; randomised algorithms; statistical analysis; tumours; biclustering; binary biclusters; coded proteins; coding sequence; drug interaction; gene expression; gene-feature matrices; iterative heuristic; quantized breast tumor gene expression matrix; randomized initialization techniques; statistically significant dense patterns; uniform memoryless source; Application software; Breast tumors; Data analysis; Drugs; Gene expression; Iterative algorithms; Proteins; Quantization; Sequences; Sparse matrices;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE
  • Print_ISBN
    0-7695-2194-0
  • Type

    conf

  • DOI
    10.1109/CSB.2004.1332467
  • Filename
    1332467