A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions

312 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Shulei Ji

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Shulei Ji - Jing Luo - Xinyu Yang

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The utilization of deep learning techniques in generating various contents (such as image, text, etc.) has become a trend. Especially music, the topic of this paper, has attracted widespread attention of countless researchers.The whole process of producing music can be divided into three stages, corresponding to the three levels of music generation: score generation produces scores, performance generation adds performance characteristics to the scores, and audio generation converts scores with performance characteristics into audio by assigning timbre or generates music in audio format directly. Previous surveys have explored the network models employed in the field of automatic music generation. However, the development history, the model evolution, as well as the pros and cons of same music generation task have not been clearly illustrated. This paper attempts to provide an overview of various composition tasks under different music generation levels, covering most of the currently popular music generation tasks using deep learning. In addition, we summarize the datasets suitable for diverse tasks, discuss the music representations, the evaluation methods as well as the challenges under different levels, and finally point out several future directions.

قيم البحث

179 - Kristen Masada , Razvan Bunescu 2018

We present a new approach to harmonic analysis that is trained to segment music into a sequence of chord spans tagged with chord labels. Formulated as a semi-Markov Conditional Random Field (semi-CRF), this joint segmentation and labeling approach en ables the use of a rich set of segment-level features, such as segment purity and chord coverage, that capture the extent to which the events in an entire segment of music are compatible with a candidate chord label. The new chord recognition model is evaluated extensively on three corpora of classical music and a newly created corpus of rock music. Experimental results show that the semi-CRF model performs substantially better than previous approaches when trained on a sufficient number of labeled examples and remains competitive when the amount of training data is limited.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

A Functional Taxonomy of Music Generation Systems

247 - Dorien Herremans , Ching-Hua Chuan , Elaine Chew 2018

Digital advances have transformed the face of automatic music generation since its beginnings at the dawn of computing. Despite the many breakthroughs, issues such as the musical tasks targeted by different machines and the degree to which they succe ed remain open questions. We present a functional taxonomy for music generation systems with reference to existing systems. The taxonomy organizes systems according to the purposes for which they were designed. It also reveals the inter-relatedness amongst the systems. This design-centered approach contrasts with predominant methods-based surveys and facilitates the identification of grand challenges to set the stage for new breakthroughs.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Multi-Level and Multi-Scale Feature Aggregation Using Sample-level Deep Convolutional Neural Networks for Music Classification

163 - Jongpil Lee , Juhan Nam 2017

Music tag words that describe music audio by text have different levels of abstraction. Taking this issue into account, we propose a music classification approach that aggregates multi-level and multi-scale features using pre-trained feature extracto rs. In particular, the feature extractors are trained in sample-level deep convolutional neural networks using raw waveforms. We show that this approach achieves state-of-the-art results on several music classification datasets.

أنظمة الصوت في الحاسوب التعلم الآلي الوسائط المتعددة

Multi-view Audio and Music Classification

153 - Huy Phan , Huy Le Nguyen , Oliver Y. Chen 2021

We propose in this work a multi-view learning approach for audio and music classification. Considering four typical low-level representations (i.e. different views) commonly used for audio and music recognition tasks, the proposed multi-view network consists of four subnetworks, each handling one input types. The learned embedding in the subnetworks are then concatenated to form the multi-view embedding for classification similar to a simple concatenation network. However, apart from the joint classification branch, the network also maintains four classification branches on the single-view embedding of the subnetworks. A novel method is then proposed to keep track of the learning behavior on the classification branches and adapt their weights to proportionally blend their gradients for network training. The weights are adapted in such a way that learning on a branch that is generalizing well will be encouraged whereas learning on a branch that is overfitting will be slowed down. Experiments on three different audio and music classification tasks show that the proposed multi-view network not only outperforms the single-view baselines but also is superior to the multi-view baselines based on concatenation and late fusion.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام

Symbolic Music Generation with Diffusion Models

85 - Gautam Mittal , Jesse Engel , Curtis Hawthorne 2021

Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show strong unconditional generation and post-hoc conditional infilling results compared to autoregressive language models operating over the same continuous embeddings.

أنظمة الصوت في الحاسوب التعلم الآلي معالجة الصوت والكلام