مرکز منطقه ای اطلاع رساني علوم و فناوري - Dependencies between transcription factor binding sites: comparison between ICA, NMF, PLSA and frequent sets

DocumentCode :

2849943

Title :

Dependencies between transcription factor binding sites: comparison between ICA, NMF, PLSA and frequent sets

Author :

Hiisilä, Heli ; Bingham, Ella

Author_Institution :

Neural Networks Res. Centre, Helsinki Univ. of Technol., Finland

fYear :

2004

fDate :

1-4 Nov. 2004

Firstpage :

114

Lastpage :

121

Abstract :

Gene expression of eucaryotes is regulated through transcription factors, which are molecules able to attach to the binding sites in the DNA sequence. These binding sites are small pieces of DNA usually found upstream from the gene they regulate. As the binding sites play an important role in the gene expression, it is of interest to find out their characteristics. In this paper, we look for dependencies and independencies between these binding sites using independent component analysis (ICA), non-negative matrix factorization (NMF), probabilistic latent semantic analysis (PLSA) and the method of frequent sets. The data used are human gene upstream regions and possible binding sites listed in a biological database. Also, results on the baker´s yeast (S. Cerevisiae) upstream regions are briefly discussed for comparison. ICA, NMF and PLSA are latent variable methods that decompose the observed data into smaller components. Of these, ICA and NMF were originally aimed for continuous data. We show that these methods can be successfully used on discrete DNA data as well. PLSA and the method of frequent sets were created for discrete data sets. The above methods reveal partially overlapping sets of possible binding sites such that the binding sites within a set are dependent of each other. The methods of frequent sets and NMF give a good overview of the most common data structures, whereas using ICA and PLSA we find large sets that are surprisingly frequent. That is, sets of very frequently occurring possible binding sites can be found near hundreds or thousands of genes; also interesting but less frequent ones co-occur surprisingly often.

Keywords :

DNA; biology computing; data mining; genetics; independent component analysis; matrix decomposition; probability; DNA sequence; S. Cerevisiae; baker yeast; biological database; data structures; eucaryotes; frequent sets; gene expression; human gene upstream regions; independent component analysis; nonnegative matrix factorization; probabilistic latent semantic analysis; transcription factor binding sites; transcription factors; Computer networks; DNA; Databases; Fungi; Gene expression; Humans; Independent component analysis; Laboratories; Neural networks; Sequences;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on

Print_ISBN :

0-7695-2142-8

Type :

conf

DOI :

10.1109/ICDM.2004.10086

Filename :

1410274

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2849943