Do you want to publish a course? Click here

PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

387   0   0.0 ( 0 )
 Added by Bryan Wang
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the piano rolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between piano rolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two commercial sound libraries. We open our source code at https://github.com/bwang514/PerformanceNet



rate research

Read More

We present in this paper PerformacnceNet, a neural network model we proposed recently to achieve score-to-audio music generation. The model learns to convert a music piece from the symbolic domain to the audio domain, assigning performance-level attributes such as changes in velocity automatically to the music and then synthesizing the audio. The model is therefore not just a neural audio synthesizer, but an AI performer that learns to interpret a musical score in its own way. The code and sample outputs of the model can be found online at https://github.com/bwang514/PerformanceNet.
Dance and music typically go hand in hand. The complexities in dance, music, and their synchronisation make them fascinating to study from a computational creativity perspective. While several works have looked at generating dance for a given music, automatically generating music for a given dance remains under-explored. This capability could have several creative expression and entertainment applications. We present some early explorations in this direction. We present a search-based offline approach that generates music after processing the entire dance video and an online approach that uses a deep neural network to generate music on-the-fly as the video proceeds. We compare these approaches to a strong heuristic baseline via human studies and present our findings. We have integrated our online approach in a live demo! A video of the demo can be found here: https://sites.google.com/view/dance2music/live-demo.
Audio signals are often represented as spectrograms and treated as 2D images. In this light, deep convolutional architectures are widely used for music audio tasks even though these two data types have very different structures. In this work, we attempt to open the black-box on deep convolutional models to inform future architectures for music audio tasks, and explain the excellent performance of deep convolutions that model spectrograms as 2D images. To this end, we expand recent explainability discussions in deep learning for natural image data to music audio data through systematic experiments using the deep features learned by various convolutional architectures. We demonstrate that deep convolutional features perform well across various target tasks, whether or not they are extracted from deep architectures originally trained on that task. Additionally, deep features exhibit high similarity to hand-crafted wavelet features, whether the deep features are extracted from a trained or untrained model.
We propose in this work a multi-view learning approach for audio and music classification. Considering four typical low-level representations (i.e. different views) commonly used for audio and music recognition tasks, the proposed multi-view network consists of four subnetworks, each handling one input types. The learned embedding in the subnetworks are then concatenated to form the multi-view embedding for classification similar to a simple concatenation network. However, apart from the joint classification branch, the network also maintains four classification branches on the single-view embedding of the subnetworks. A novel method is then proposed to keep track of the learning behavior on the classification branches and adapt their weights to proportionally blend their gradients for network training. The weights are adapted in such a way that learning on a branch that is generalizing well will be encouraged whereas learning on a branch that is overfitting will be slowed down. Experiments on three different audio and music classification tasks show that the proposed multi-view network not only outperforms the single-view baselines but also is superior to the multi-view baselines based on concatenation and late fusion.
The identification of structural differences between a music performance and the score is a challenging yet integral step of audio-to-score alignment, an important subtask of music information retrieval. We present a novel method to detect such differences between the score and performance for a given piece of music using progressively dilated convolutional neural networks. Our method incorporates varying dilation rates at different layers to capture both short-term and long-term context, and can be employed successfully in the presence of limited annotated data. We conduct experiments on audio recordings of real performances that differ structurally from the score, and our results demonstrate that our models outperform standard methods for structure-aware audio-to-score alignment.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا