DocumentCode
2729184
Title
A novel algorithm for technical articles classification based on gene selection
Author
Kilany, R. ; Ammar, Reda ; Rajasekaran, Sanguthevar
Author_Institution
Comput. Sci. & Eng. Dept., Univ. of Connecticut, Storrs, CT, USA
fYear
2012
fDate
1-4 July 2012
Abstract
Research in science and engineering has resulted in the generation of voluminous datasets. For instance, biological databases such as PubMed now have millions of articles. Given this growth in data, the problem of retrieving information relevant to a specific topic has become a big challenge. In this paper we focus on the problem of retrieving articles pertaining to a given topic from among a huge collection of articles. In particular, we investigate the problem of classifying articles. Though numerous techniques and tools are available for documents classification, a shortcoming in them is that they take too much time. In this paper we present generic computational techniques that can classify articles efficiently. Our algorithms are based on algorithms that have been proposed for a related problem called gene selection. Gene selection is the problem of identifying a minimum set of genes that are responsible for certain events (for example the presence of cancer). Even though gene selection was originally proposed for biological data analysis, the technique itself is generic. For example, `genes´ can be thought of as generic variable. A typical tool that we envision will take as input a set of keywords (that characterize the information of interest) and will develop a learner that will identify a small subset of the keywords that are capable of classifying papers into two types. A paper is of the first type if it has information of interest and a paper is of the second type if the paper does not have information of interest. Experiments show that the new algorithm obtains a higher classification accuracy using a smaller number of selected keywords when compared to one of the best algorithms reported in the literature.
Keywords
data mining; information retrieval; learning (artificial intelligence); pattern classification; support vector machines; text analysis; article collection; article retrieval; data growth; document classification; gene selection; generic computational technique; generic variable; information characterization; information retrieval; keyword selection; keyword subset identification; learning; paper classification; technical article classification; text mining; voluminous dataset; Accuracy; Algorithm design and analysis; Classification algorithms; Correlation; Kernel; Support vector machines; Training; Data Minimg; SVM; document classification; text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Computers and Communications (ISCC), 2012 IEEE Symposium on
Conference_Location
Cappadocia
ISSN
1530-1346
Print_ISBN
978-1-4673-2712-1
Electronic_ISBN
1530-1346
Type
conf
DOI
10.1109/ISCC.2012.6249300
Filename
6249300
Link To Document