• DocumentCode
    2324343
  • Title

    Evolved Apache Lucene SpanFirst queries are good text classifiers

  • Author

    Hirsch, Laurie

  • Author_Institution
    Dept. of Comput., Sheffield Hallam Univ., Sheffield, UK
  • fYear
    2010
  • fDate
    18-23 July 2010
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Human readable text classifiers have a number of advantages over classifiers based on complex and opaque mathematical models. For some time now search queries or rules have been used for classification purposes, either constructed manually or automatically. We have performed experiments using genetic algorithms to evolve text classifiers in search query format with the combined objective of classifier accuracy and classifier readability. We have found that a small set of disjunct Lucene SpanFirst queries effectively meet both goals. This kind of query evaluates to true for a document if a particular word occurs within the first N words of a document. Previously researched classifiers based on queries using combinations of words connected with OR, AND and NOT were found to be generally less accurate and (arguably) less readable. The approach is evaluated using standard test sets Reuters-21578 and Ohsumed and compared against several classification algorithms.
  • Keywords
    mathematical analysis; pattern classification; query processing; text analysis; Ohsumed; Reuters-21578; document; evolved Apache Lucene SpanFirst queries; opaque mathematical models; search query format; text classifiers; Accuracy; Classification algorithms; Construction industry; Humans; Petroleum; Text categorization; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Evolutionary Computation (CEC), 2010 IEEE Congress on
  • Conference_Location
    Barcelona
  • Print_ISBN
    978-1-4244-6909-3
  • Type

    conf

  • DOI
    10.1109/CEC.2010.5585955
  • Filename
    5585955