• DocumentCode
    3570920
  • Title

    Leveraging the web for automating tag expansion for low-content items

  • Author

    Singhal, Ayush ; Srivastava, Jaideep

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of Minnesota, Minneapolis, MN, USA
  • fYear
    2014
  • Firstpage
    545
  • Lastpage
    552
  • Abstract
    Tags, as high quality semantic descriptors, are used in categorization, clustering and efficient retrieval of various items in the web corpus. Images, videos, songs and similar multimedia items are the most common items which are tagged either manually or in a semiautomatic manner. However, the tagging process becomes complicated when the content structure of an item is not interpretable. Such a problems occurs in items like scientific research datasets or documents with very little text content. In this work, we propose a generalized approach to automate tag expansion for such low-content items. We leverage intelligence of the web to generate secondary content for such items for the tag expansion process. While automating tag expansion, we also address the problem of topic drift by automating removal of the noisy tags from the set of candidate new tags. The effectiveness of the proposed approach is tested on a real world dataset. The performance of the proposed is compared with Wikipedia based nearest neighbor tagging (WikiSem) and non-negative matrix factorization (NMF) based tag expansion approaches. Based on the Mean Reciprocal Rank (MRR) metric, the proposed approach was twice as accurate as the WikiSem baseline (0.27 vs 0.13) and at least 2.25 times the NMF baselines (0.27 vs 0.12).
  • Keywords
    Internet; Web sites; information retrieval; text analysis; MRR metric; NMF based tag expansion approach; Web corpus; WikiSem; Wikipedia based nearest neighbor tagging; high quality semantic descriptors; item content structure; low-content items; mean reciprocal rank metric; multimedia items; nonnegative matrix factorization; tag expansion automation process; text content; topic drift problem; Databases; Google; Noise measurement; Search engines; Semantics; Tagging;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
  • Type

    conf

  • DOI
    10.1109/IRI.2014.7051937
  • Filename
    7051937