Learning Optimal Features for Polyphonic Audio-to-Score Alignment

Author

Joder, Cyril ; Essid, Slim ; Richard, Guilhem

Author_Institution

Inst. for Human-Machine Commun., Tech. Univ. Munich, Munich, Germany

Volume

21

Issue

10

fYear

2013

fDate

Oct. 2013

Firstpage

2118

Lastpage

2128

Abstract

This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.

Keywords

audio signal processing; maximum likelihood estimation; signal representation; CQT-based representation; audio observations; conditional random fields model; discriminative framework; feature functions design; heuristic mappings; learning optimal features; linear transformation; maximum likelihood criterion; musical recording; polyphonic audio-to-score alignment; polyphonic music; spectrogram; symbolic representation; symmetric Kull-back-Leibler divergence; template construction; template vectors; temporal constraints; Music information retrieval; audio-to-score alignment; conditional random fields; discriminative learning;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TASL.2013.2266794

Filename

6525340