ﻻ يوجد ملخص باللغة العربية
In natural language processing (NLP), the semantic similarity task requires large-scale, high-quality human-annotated labels for fine-tuning or evaluation. By contrast, in cases of music similarity, such labels are expensive to collect and largely dependent on the annotators artistic preferences. Recent research has demonstrated that embedding calibration technique can greatly increase semantic similarity performance of the pre-trained language model without fine-tuning. However, it is yet unknown which calibration method is the best and how much performance improvement can be achieved. To address these issues, we propose using composer information to construct labels for automatically evaluating music similarity. Under this paradigm, we discover the optimal combination of embedding calibration which achieves superior metrics than the baseline methods.
Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to
Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While
A key aspect of machine learning models lies in their ability to learn efficient intermediate features. However, the input representation plays a crucial role in this process, and polyphonic musical scores remain a particularly complex type of inform
We present a new approach to harmonic analysis that is trained to segment music into a sequence of chord spans tagged with chord labels. Formulated as a semi-Markov Conditional Random Field (semi-CRF), this joint segmentation and labeling approach en
This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Ins