Descriptions are often provided along with recommendations to help users discovery. Recommending automatically generated music playlists (e.g. personalised playlists) introduces the problem of generating descriptions. In this paper, we propose a method for generating music playlist descriptions, which is called as music captioning. In the proposed method, audio content analysis and natural language processing are adopted to utilise the information of each track.
As music streaming services dominate the music industry, the playlist is becoming an increasingly crucial element of music consumption. Con- sequently, the music recommendation problem is often casted as a playlist generation prob- lem. Better understanding of the playlist is there- fore necessary for developing better playlist gen- eration algorithms. In this work, we analyse two playlist datasets to investigate some com- monly assumed hypotheses about playlists. Our findings indicate that deeper understanding of playlists is needed to provide better prior infor- mation and improve machine learning algorithms in the design of recommendation systems.
Generating accurate descriptions for online fashion items is important not only for enhancing customers shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the rich attributes of fashion items. We seed the description of an item by first identifying its attributes, and introduce attribute-level semantic (ALS) reward and sentence-level semantic (SLS) reward as metrics to improve the quality of text descriptions. We further integrate the training of our model with maximum likelihood estimation (MLE), attribute embedding, and Reinforcement Learning (RL). To facilitate the learning, we build a new FAshion CAptioning Dataset (FACAD), which contains 993K images and 130K corresponding enchanting and diverse descriptions. Experiments on FACAD demonstrate the effectiveness of our model.
Recent advances in biosensors technology and mobile electroencephalographic (EEG) interfaces have opened new application fields for cognitive monitoring. A computable biomarker for the assessment of spontaneous aesthetic brain responses during music listening is introduced here. It derives from well-established measures of cross-frequency coupling (CFC) and quantifies the music-induced alterations in the dynamic relationships between brain rhythms. During a stage of exploratory analysis, and using the signals from a suitably designed experiment, we established the biomarker, which acts on brain activations recorded over the left prefrontal cortex and focuses on the functional coupling between high-beta and low-gamma oscillations. Based on data from an additional experimental paradigm, we validated the introduced biomarker and showed its relevance for expressing the subjective aesthetic appreciation of a piece of music. Our approach resulted in an affordable tool that can promote human-machine interaction and, by serving as a personalized music annotation strategy, can be potentially integrated into modern flexible music recommendation systems. Keywords: Cross-frequency coupling; Human-computer interaction; Brain-computer interface
In this paper we propose a deep learning method for performing attributed-based music-to-image translation. The proposed method is applied for synthesizing visual stories according to the sentiment expressed by songs. The generated images aim to induce the same feelings to the viewers, as the original song does, reinforcing the primary aim of music, i.e., communicating feelings. The process of music-to-image translation poses unique challenges, mainly due to the unstable mapping between the different modalities involved in this process. In this paper, we employ a trainable cross-modal translation method to overcome this limitation, leading to the first, to the best of our knowledge, deep learning method for generating sentiment-aware visual stories. Various aspects of the proposed method are extensively evaluated and discussed using different songs.
Sequential modelling entails making sense of sequential data, which naturally occurs in a wide array of domains. One example is systems that interact with users, log user actions and behaviour, and make recommendations of items of potential interest to users on the basis of their previous interactions. In such cases, the sequential order of user interactions is often indicative of what the user is interested in next. Similarly, for systems that automatically infer the semantics of text, capturing the sequential order of words in a sentence is essential, as even a slight re-ordering could significantly alter its original meaning. This thesis makes methodological contributions and new investigations of sequential modelling for the specific application areas of systems that recommend music tracks to listeners and systems that process text semantics in order to automatically fact-check claims, or speed read text for efficient further classification. (Rest of abstract omitted due to arXiv abstract limit)