DocumentCode
3499591
Title
Optimistic bias in the assessment of high dimensional classifiers with a limited dataset
Author
Chen, Weijie ; Brown, David G.
Author_Institution
Food & Drug Adm., Silver Spring, MD, USA
fYear
2011
fDate
July 31 2011-Aug. 5 2011
Firstpage
2698
Lastpage
2703
Abstract
It is commonly recognized that using the same dataset for training and testing the classifier introduces optimistic bias in estimating classifier performance. However, bias of the same kind may still exist even when independent datasets are used for training and testing a classifier. This problem is especially important in the setting of high dimensional feature space and limited data. Bioinformatics data is typically characterized by a tremendous amount of data per patient but from a limited number of patients. Often the entire data set is utilized in a “pre-training” stage during which the feature set is winnowed to a manageable number, and the parameters of the training algorithm are established. Subsequently the data is bifurcated into training and test sets; however, bias has already been introduced into the classifier development process. We investigate the significance of this bias by performing simulated gene expression experiments. We find that, for data with moderate intrinsic separability and modest sample size, any observed separation is due to selection bias introduced in the aforementioned pre-training process. For greater intrinsic separability, correct data hygiene, i.e., complete separation of development and validation data yields a positive result, but one far less impressive than that mistakenly obtained using incomplete data separation.
Keywords
bioinformatics; genetics; pattern classification; bioinformatics data; classifier development process; gene expression; high dimensional classifier; optimistic bias; training algorithm; Breast cancer; Classification algorithms; Covariance matrix; Measurement; Signal to noise ratio; Testing; Training;
fLanguage
English
Publisher
ieee
Conference_Titel
Neural Networks (IJCNN), The 2011 International Joint Conference on
Conference_Location
San Jose, CA
ISSN
2161-4393
Print_ISBN
978-1-4244-9635-8
Type
conf
DOI
10.1109/IJCNN.2011.6033572
Filename
6033572
Link To Document