DocumentCode :
80961
Title :
The Depth Problem: Identifying the Most Representative Units in a Data Group
Author :
Irigoien, Itziar ; Mestres, Francesc ; Arenas, Concepcion
Author_Institution :
Dept. of Comput. Sci. & Artificial Intell., Univ. of the Basque Country, Donostia, Spain
Volume :
10
Issue :
1
fYear :
2013
fDate :
Jan.-Feb. 2013
Firstpage :
161
Lastpage :
172
Abstract :
This paper presents a solution to the problem of how to identify the units in groups or clusters that have the greatest degree of centrality and best characterize each group. This problem frequently arises in the classification of data such as types of tumor, gene expression profiles or general biomedical data. It is particularly important in the common context that many units do not properly belong to any cluster. Furthermore, in gene expression data classification, good identification of the most central units in a cluster enables recognition of the most important samples in a particular pathological process. We propose a new depth function that allows us to identify central units. As our approach is based on a measure of distance or dissimilarity between any pair of units, it can be applied to any kind of multivariate data (continuous, binary or multiattribute data). Therefore, it is very valuable in many biomedical applications, which usually involve noncontinuous data, such as clinical, pathological, or biological data sources. We validate the approach using artificial examples and apply it to empirical data. The results show the good performance of our statistical approach.
Keywords :
bioinformatics; biological techniques; biomedical engineering; data analysis; pattern classification; pattern clustering; statistical analysis; binary data; biological data sources; biomedical applications; centrality degree; clinical data sources; data group; depth problem; gene expression data classification; gene expression profiles; general biomedical data; most representative units; multiattribute data; multivariate data; noncontinuous data; pathological data sources; statistical approach; tumor types; Bioinformatics; Computational biology; Context; Covariance matrix; Gaussian distribution; Gene expression; Kernel; Cluster analysis; central unit; data depth; depth function; gene expression data; geometric variability; kernel; proximity function; Algorithms; Cluster Analysis; Computational Biology; Computer Simulation; Databases, Factual; Gene Expression Profiling; Humans; Models, Biological; Neoplasms; Reproducibility of Results;
fLanguage :
English
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
1545-5963
Type :
jour
DOI :
10.1109/TCBB.2012.147
Filename :
6365622
Link To Document :
بازگشت