DocumentCode :
2849943
Title :
Dependencies between transcription factor binding sites: comparison between ICA, NMF, PLSA and frequent sets
Author :
Hiisilä, Heli ; Bingham, Ella
Author_Institution :
Neural Networks Res. Centre, Helsinki Univ. of Technol., Finland
fYear :
2004
fDate :
1-4 Nov. 2004
Firstpage :
114
Lastpage :
121
Abstract :
Gene expression of eucaryotes is regulated through transcription factors, which are molecules able to attach to the binding sites in the DNA sequence. These binding sites are small pieces of DNA usually found upstream from the gene they regulate. As the binding sites play an important role in the gene expression, it is of interest to find out their characteristics. In this paper, we look for dependencies and independencies between these binding sites using independent component analysis (ICA), non-negative matrix factorization (NMF), probabilistic latent semantic analysis (PLSA) and the method of frequent sets. The data used are human gene upstream regions and possible binding sites listed in a biological database. Also, results on the baker´s yeast (S. Cerevisiae) upstream regions are briefly discussed for comparison. ICA, NMF and PLSA are latent variable methods that decompose the observed data into smaller components. Of these, ICA and NMF were originally aimed for continuous data. We show that these methods can be successfully used on discrete DNA data as well. PLSA and the method of frequent sets were created for discrete data sets. The above methods reveal partially overlapping sets of possible binding sites such that the binding sites within a set are dependent of each other. The methods of frequent sets and NMF give a good overview of the most common data structures, whereas using ICA and PLSA we find large sets that are surprisingly frequent. That is, sets of very frequently occurring possible binding sites can be found near hundreds or thousands of genes; also interesting but less frequent ones co-occur surprisingly often.
Keywords :
DNA; biology computing; data mining; genetics; independent component analysis; matrix decomposition; probability; DNA sequence; S. Cerevisiae; baker yeast; biological database; data structures; eucaryotes; frequent sets; gene expression; human gene upstream regions; independent component analysis; nonnegative matrix factorization; probabilistic latent semantic analysis; transcription factor binding sites; transcription factors; Computer networks; DNA; Databases; Fungi; Gene expression; Humans; Independent component analysis; Laboratories; Neural networks; Sequences;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
Print_ISBN :
0-7695-2142-8
Type :
conf
DOI :
10.1109/ICDM.2004.10086
Filename :
1410274
Link To Document :
بازگشت