Do you want to publish a course? Click here

Learning to Recognize Musical Genre from Audio

84   0   0.0 ( 0 )
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

We here summarize our experience running a challenge with open data for musical genre recognition. Those notes motivate the task and the challenge design, show some statistics about the submissions, and present the results.



rate research

Read More

Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features, which cannot be adapted to different acoustic conditions. We propose a method to overcome this limitation using learned frame similarity for audio-to-score alignment. We focus on offline audio-to-score alignment of piano music. Experiments on music data from different acoustic conditions demonstrate that our method achieves higher alignment accuracy than a standard DTW-based method that uses handcrafted features, and generates robust alignments whilst being adaptable to different domains at the same time.
90 - Rong Gong , Xavier Serra 2018
In this paper, we propose an efficient and reproducible deep learning model for musical onset detection (MOD). We first review the state-of-the-art deep learning models for MOD, and identify their shortcomings and challenges: (i) the lack of hyper-parameter tuning details, (ii) the non-availability of code for training models on other datasets, and (iii) ignoring the network capability when comparing different architectures. Taking the above issues into account, we experiment with seven deep learning architectures. The most efficient one achieves equivalent performance to our implementation of the state-of-the-art architecture. However, it has only 28.3% of the total number of trainable parameters compared to the state-of-the-art. Our experiments are conducted using two different datasets: one mainly consists of instrumental music excerpts, and another developed by ourselves includes only solo singing voice excerpts. Further, inter-dataset transfer learning experiments are conducted. The results show that the model pre-trained on one dataset fails to detect onsets on another dataset, which denotes the importance of providing the implementation code to enable re-training the model for a different dataset. Datasets, code and a Jupyter notebook running on Google Colab are publicly available to make this research understandable and easy to reproduce.
The automated recognition of music genres from audio information is a challenging problem, as genre labels are subjective and noisy. Artist labels are less subjective and less noisy, while certain artists may relate more strongly to certain genres. At the same time, at prediction time, it is not guaranteed that artist labels are available for a given audio segment. Therefore, in this work, we propose to apply the transfer learning framework, learning artist-related information which will be used at inference time for genre classification. We consider different types of artist-related information, expressed through artist group factors, which will allow for more efficient learning and stronger robustness to potential label noise. Furthermore, we investigate how to achieve the highest validation accuracy on the given FMA dataset, by experimenting with various kinds of transfer methods, including single-task transfer, multi-task transfer and finally multi-task learning.
We present a semantic vector space model for capturing complex polyphonic musical context. A word2vec model based on a skip-gram representation with negative sampling was used to model slices of music from a dataset of Beethovens piano sonatas. A visualization of the reduced vector space using t-distributed stochastic neighbor embedding shows that the resulting embedded vector space captures tonal relationships, even without any explicit information about the musical contents of the slices. Secondly, an excerpt of the Moonlight Sonata from Beethoven was altered by replacing slices based on context similarity. The resulting music shows that the selected slice based on similar word2vec context also has a relatively short tonal distance from the original slice.
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects in the audio stream. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene. In support of this new task, we develop a large-scale dataset that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over traditional audio-only methods. Project page: http://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا