A scalable solution for group feature selection

Author

Priya Govindan;Ruobing Chen;Katya Scheinberg;Soundararajan Srinivasan

Author_Institution

Rutgers University

fYear

2015

Firstpage

2846

Lastpage

2848

Abstract

In many applications, we may want to build a classifier with high confidence, while reducing the number of features. We consider the case where features are assigned to predefined groups and cannot be removed individually. An additional and important constraint is that the datasets may be very large and may not fit in memory. We use logistic regression with group penalty, which results in sparse solutions at the group level. In our implementation, we apply L-BFGS to approximate the quadratic loss function of logistic regression and use Block Co-ordinate Descent to solve for each group. Our contributions can be summarized as follows: (1) we discuss different scalable approaches, depending on characteristics of the dataset, such as, large number of data points or large number of features or large number of groups; (2) for datasets with large number of data points and few groups of features, we identify the bottlenecks for scalability; (3) we present Spark solutions in Python and discuss the advantages of our solution over alternate solutions; (4) we present the experiments and results on synthetic data and real data from manufacturing applications.

Keywords

"Sparks","Logistics","Runtime","Sparse matrices","Approximation methods","Machine learning algorithms","Big data"

Publisher

ieee

Conference_Titel

Big Data (Big Data), 2015 IEEE International Conference on

Type

conf

DOI

10.1109/BigData.2015.7364098

Filename

7364098