Title :
Empirical evaluation of ensemble feature subset selection methods for learning from a high-dimensional database in drug design
Author :
Mamitsuka, Hiroshi
Author_Institution :
Inst. for Chem. Res., Kyoto Univ., Uji, Japan
Abstract :
Discovering a new drug is one of the most important goals in not only the pharmaceutical field but also a variety of fields including molecular biology, chemistry and medical science. The importance of computationally understanding the relationships between a given chemical compound and its drug activity has been pronounced. In the data set regarding drug activity of chemical compounds, each row corresponds to a chemical compound, and columns are the descriptors of the compound and a label indicating drug activity of the compound Recently, the size of the descriptors has become larger to obtain more detailed information from a given set of compounds. Actually, the number of columns (attributes or features) of some drug data sets reaches hundreds of thousands or a million. The purpose of this paper is to empirically evaluate the performance of ensemble feature subset selection strategies by applying them to such a high-dimensional data set actually used in the process of drug design. We examined the performance of three ensemble methods, including a query learning based method, comparing with that of one of the latest feature subset selection methods. The evaluation was performed on a data set which contains approximately 140,000 features. Our results show that the query learning based methodology outperformed the other three methods, in terms of the final prediction accuracy and time efficiency. We have also examined the effect of noise in the data and found that the advantage of the method becomes more pronounced for larger noise levels.
Keywords :
biochemistry; database management systems; learning (artificial intelligence); medical computing; patient treatment; pharmaceutical industry; chemical compound; chemistry; data set; drug activity; drug design; ensemble feature subset selection methods; feature subset selection methods; feature subset selection strategies; high-dimensional database; medical science; molecular biology; noise levels; pharmaceutical field; query learning based method; Biology computing; Chemical compounds; Chemistry; Drugs; Learning systems; Noise level; Performance evaluation; Pharmaceuticals; Process design; Spatial databases;
Conference_Titel :
Bioinformatics and Bioengineering, 2003. Proceedings. Third IEEE Symposium on
Print_ISBN :
0-7695-1907-5
DOI :
10.1109/BIBE.2003.1188959