• DocumentCode
    2729184
  • Title

    A novel algorithm for technical articles classification based on gene selection

  • Author

    Kilany, R. ; Ammar, Reda ; Rajasekaran, Sanguthevar

  • Author_Institution
    Comput. Sci. & Eng. Dept., Univ. of Connecticut, Storrs, CT, USA
  • fYear
    2012
  • fDate
    1-4 July 2012
  • Abstract
    Research in science and engineering has resulted in the generation of voluminous datasets. For instance, biological databases such as PubMed now have millions of articles. Given this growth in data, the problem of retrieving information relevant to a specific topic has become a big challenge. In this paper we focus on the problem of retrieving articles pertaining to a given topic from among a huge collection of articles. In particular, we investigate the problem of classifying articles. Though numerous techniques and tools are available for documents classification, a shortcoming in them is that they take too much time. In this paper we present generic computational techniques that can classify articles efficiently. Our algorithms are based on algorithms that have been proposed for a related problem called gene selection. Gene selection is the problem of identifying a minimum set of genes that are responsible for certain events (for example the presence of cancer). Even though gene selection was originally proposed for biological data analysis, the technique itself is generic. For example, `genes´ can be thought of as generic variable. A typical tool that we envision will take as input a set of keywords (that characterize the information of interest) and will develop a learner that will identify a small subset of the keywords that are capable of classifying papers into two types. A paper is of the first type if it has information of interest and a paper is of the second type if the paper does not have information of interest. Experiments show that the new algorithm obtains a higher classification accuracy using a smaller number of selected keywords when compared to one of the best algorithms reported in the literature.
  • Keywords
    data mining; information retrieval; learning (artificial intelligence); pattern classification; support vector machines; text analysis; article collection; article retrieval; data growth; document classification; gene selection; generic computational technique; generic variable; information characterization; information retrieval; keyword selection; keyword subset identification; learning; paper classification; technical article classification; text mining; voluminous dataset; Accuracy; Algorithm design and analysis; Classification algorithms; Correlation; Kernel; Support vector machines; Training; Data Minimg; SVM; document classification; text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications (ISCC), 2012 IEEE Symposium on
  • Conference_Location
    Cappadocia
  • ISSN
    1530-1346
  • Print_ISBN
    978-1-4673-2712-1
  • Electronic_ISBN
    1530-1346
  • Type

    conf

  • DOI
    10.1109/ISCC.2012.6249300
  • Filename
    6249300