Deep Gaussian processes for music mood estimation and retrieval with locally aggregated acoustic Fisher vector
SANTOSH CHAPANERI DEEPAK JAYASWAL
Click here to view fulltext PDF
Due to the subjective nature of music mood, it is challenging to computationally model the affective content of the music. In this work, we propose novel features known as locally aggregated acoustic Fisher vectors based on the Fisher kernel paradigm. To preserve the temporal context, onset-detected variable-lengthsegments of the audio songs are obtained, for which a variational Bayesian approach is used to learn the universal background Gaussian mixture model (GMM) representation of the standard acoustic features. The local Fisher vectors obtained with the soft assignment of GMM are aggregated to obtain a better performance relative to the global Fisher vector. A deep Gaussian process (DGP) regression model inspired by the deep learning architectures is proposed to learn the mapping between the proposed Fisher vector features and the mood dimensions of valence and arousal. Since the exact inference on DGP is intractable, the pseudo-data approximation is used to reduce the training complexity and the Monte Carlo sampling technique is used to solve the intractability problem during training. A detailed derivation of a 3-layer DGP is presented that can be easily generalized to an L-layer DGP. The proposed work is evaluated on the PMEmo dataset containing valence and arousal annotations of Western popular music and achieves an improvement in R² of 25% for arousal and 52% for valence for music mood estimation and an improvement in the Gamma statistic of 68% for music mood retrieval relative to the baseline single-layer Gaussian process.
SANTOSH CHAPANERI1 DEEPAK JAYASWAL1
Volume 48, 2023
Continuous Article Publishing mode
Click here for Editorial Note on CAP Mode