• DocumentCode
    548904
  • Title

    Semi-automatic creation of a reference news corpus for fine-grained multi-label scenarios

  • Author

    Teixeira, Jorge ; Sarmento, Luis ; Oliveira, Eugenio

  • Author_Institution
    Labs. SAPO UP, FEUP, Porto, Portugal
  • fYear
    2011
  • fDate
    15-18 June 2011
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.
  • Keywords
    classification; dictionaries; information resources; support vector machines; text analysis; dictionary-based classification; fine-grained multilabel scenarios; human annotators topic-related labels; nearest neighbor classifiers; reference news corpus; support vector machines; text classification; Dictionaries; Manuals; System-on-a-chip; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Systems and Technologies (CISTI), 2011 6th Iberian Conference on
  • Conference_Location
    Chaves
  • Print_ISBN
    978-1-4577-1487-0
  • Type

    conf

  • Filename
    5974354