• DocumentCode
    1650936
  • Title

    Dual-layer bag-of-frames model for music genre classification

  • Author

    Yeh, Chin-Chia Michael ; Li Su ; Yi-Hsuan Yang

  • Author_Institution
    Res. Center for Inf. Technol. Innovation, Acad. Sinica, Taipei, Taiwan
  • fYear
    2013
  • Firstpage
    246
  • Lastpage
    250
  • Abstract
    This paper concerns the development of a music dictionary-based model for summarizing local feature descriptors computed over time. Comparing to a holistic representation, this text-like, bag-of-frames representation better captures the rich and time-varying information of music. However, the dictionary used in classical bag-of-frames model only captures frame-level elements of the music; thus, there exists a semantic gap between the dictionary element and commonly seen music description. In order to reduce the gap, a new feature representation called dual-layer bag-of-frames is proposed in this paper. It models the music with a two layer structure, where the first-layer dictionary captures the frame-level characteristics, and the second-layer dictionary captures the segment-level semantics. This hierarchical structure resembles the alphabet-word-document structure of text. Our result demonstrates that the proposed dual-layer bag-of-frames feature achieves state-of-the-art accuracy of music genre classification. The classification accuracy for the GTZAN benchmark reaches 86.7% with dictionary trained from GTZAN, and 83.6% with dictionary trained from another data set USPOP.
  • Keywords
    audio signal processing; dictionaries; music; signal classification; signal representation; text analysis; GTZAN benchmark; alphabet-word-document structure; audio word; classification accuracy; dictionary element; dual-layer bag-of-frames model; feature representation; first-layer dictionary; frame-level characteristics; frame-level elements; hierarchical structure; local feature descriptors; music description; music dictionary-based model; music genre classification; second-layer dictionary; segment-level semantics; semantic gap; text-like bag-of-frames representation; time-varying information; Accuracy; Dictionaries; Encoding; Histograms; Kernel; Support vector machines; Training; Sparse coding; audio alphabets; audio words; deep structure; music genre classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
  • Conference_Location
    Vancouver, BC
  • ISSN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2013.6637646
  • Filename
    6637646