ﻻ يوجد ملخص باللغة العربية
Timbre representations of musical instruments, essential for diverse applications such as musical audio synthesis and separation, might be learned as bottleneck features from an instrumental recognition model. Given the similarities between speaker recognition and musical instrument recognition, in this paper, we investigate how to adapt successful speaker recognition algorithms to musical instrument recognition to learn meaningful instrumental timbre representations. To address the mismatch between musical audio and models devised for speech, we introduce a group of trainable filters to generate proper acoustic features from input raw waveforms, making it easier for a model to be optimized in an input-agnostic and end-to-end manner. Through experiments on both the NSynth and RWC databases in both musical instrument closed-set identification and open-set verification scenarios, the modified speaker recognition model was capable of generating discriminative embeddings for instrument and instrument-family identities. We further conducted extensive experiments to characterize the encoded information in learned timbre embeddings.
In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to l
Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music au
The objective of this paper is open-set speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance.
Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather th
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical voc