• Title of article

    Improving statistical keyword detection in short texts: Entropic and clustering approaches

  • Author/Authors

    Carretero-Campos، نويسنده , , C. and Bernaola-Galvلn، نويسنده , , P. and Coronado، نويسنده , , A.V. and Carpena، نويسنده , , P.، نويسنده ,

  • Issue Information
    روزنامه با شماره پیاپی سال 2013
  • Pages
    12
  • From page
    1481
  • To page
    1492
  • Abstract
    In the last years, two successful approaches have been introduced to tackle the problem of statistical keyword detection in a text without the use of external information: (i) The entropic approach, where Shannon’s entropy of information is used to quantify the information content of the sequence of occurrences of each word in the text; and (ii) The clustering approach, which links the heterogeneity of the spatial distribution of a word in the text (clustering) with its relevance. In this paper, first we present some modifications to both techniques which improve their results. Then, we propose new metrics to evaluate the performance of keyword detectors based specifically on the needs of a typical user, and we employ them to find out which approach performs better. Although both approaches work well in long texts, we obtain in general that measures based on word-clustering perform at least as well as the entropic measure, which needs a convenient partition of the text to be applied, such as chapters of a book. In the latter approach we also show that the partition of the text chosen affects strongly its results. Finally, we focus on short texts, a case of high practical importance, such as short reports, web pages, scientific articles, etc. We show that the performance of word-clustering measures is also good in generic short texts since these measures are able to discriminate better the degree of relevance of low frequency words than the entropic approach.
  • Keywords
    Keyword detection , linguistic , Statistical analysis , entropy
  • Journal title
    Physica A Statistical Mechanics and its Applications
  • Serial Year
    2013
  • Journal title
    Physica A Statistical Mechanics and its Applications
  • Record number

    1736718