• DocumentCode
    2915653
  • Title

    TF-SIDF: Term frequency, sketched inverse document frequency

  • Author

    Baena-García, Manuel ; Carmona-Cejudo, José M. ; Castillo, Gladys ; Morales-Bueno, Rafael

  • Author_Institution
    Dipt. Lenguajes y Cienc. de la Comput., Univ. de Malaga, Malaga, Spain
  • fYear
    2011
  • fDate
    22-24 Nov. 2011
  • Firstpage
    1044
  • Lastpage
    1049
  • Abstract
    Exact calculation of the TF-IDF weighting function in massive streams of documents involves challenging memory space requirements. In this work, we propose TF-SIDF, a novel solution for extracting relevant words from streams of documents with a high number of terms. TF-SIDF relies on the Count-Min Sketch data structure, which allows to estimate the counts of all the terms in the stream. Results of the experiments conducted with two dataset show that this sketch-based algorithm achieves good approximations of the TF-IDF weighting values (as a rule, the top terms with highest TF-IDF values remaining the same), while substantial savings in memory usage are observed. It is also observed that the performance is highly correlated with the sketch size, and that wider sketch configurations are preferable given the same sketch size.
  • Keywords
    data mining; data structures; storage management; text analysis; TF-IDF weighting function; TF-IDF weighting values; TF-SIDF; count-min sketch data structure; exact calculation; massive streams; memory space requirements; memory usage; sketch configurations; sketch size; sketch-based algorithm; sketched inverse document frequency; term frequency; Approximation methods; Correlation; Data structures; Graphics; Intelligent systems; Measurement; Radiation detectors; count-min sketch; text mining; tfidf;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on
  • Conference_Location
    Cordoba
  • ISSN
    2164-7143
  • Print_ISBN
    978-1-4577-1676-8
  • Type

    conf

  • DOI
    10.1109/ISDA.2011.6121796
  • Filename
    6121796