No Arabic abstract
Research on mid-level image representations has conventionally concentrated relatively obvious attributes and overlooked non-obvious attributes, i.e., characteristics that are not readily observable when images are viewed independently of their context or function. Non-obvious attributes are not necessarily easily nameable, but nonetheless they play a systematic role in people`s interpretation of images. Clusters of related non-obvious attributes, called interpretation dimensions, emerge when people are asked to compare images, and provide important insight on aspects of social images that are considered relevant. In contrast to aesthetic or affective approaches to image analysis, non-obvious attributes are not related to the personal perspective of the viewer. Instead, they encode a conventional understanding of the world, which is tacit, rather than explicitly expressed. This paper introduces a procedure for discovering non-obvious attributes using crowdsourcing. We discuss this procedure using a concrete example of a crowdsourcing task on Amazon Mechanical Turk carried out in the domain of fashion. An analysis comparing discovered non-obvious attributes with user tags demonstrated the added value delivered by our procedure.
Musical preferences have been considered a mirror of the self. In this age of Big Data, online music streaming services allow us to capture ecologically valid music listening behavior and provide a rich source of information to identify several user-specific aspects. Studies have shown musical engagement to be an indirect representation of internal states including internalized symptomatology and depression. The current study aims at unearthing patterns and trends in the individuals at risk for depression as it manifests in naturally occurring music listening behavior. Mental well-being scores, musical engagement measures, and listening histories of Last.fm users (N=541) were acquired. Social tags associated with each listeners most popular tracks were analyzed to unearth the mood/emotions and genres associated with the users. Results revealed that social tags prevalent in the users at risk for depression were predominantly related to emotions depicting Sadness associated with genre tags representing neo-psychedelic-, avant garde-, dream-pop. This study will open up avenues for an MIR-based approach to characterizing and predicting risk for depression which can be helpful in early detection and additionally provide bases for designing music recommendations accordingly.
This paper aims at discovering meaningful subsets of related images from large image collections without annotations. We search groups of images related at different levels of semantic, i.e., either instances or visual classes. While k-means is usually considered as the gold standard for this task, we evaluate and show the interest of diffusion methods that have been neglected by the state of the art, such as the Markov Clustering algorithm. We report results on the ImageNet and the Paris500k instance dataset, both enlarged with images from YFCC100M. We evaluate our methods with a labelling cost that reflects how much effort a human would require to correct the generated clusters. Our analysis highlights several properties. First, when powered with an efficient GPU implementation, the cost of the discovery process is small compared to computing the image descriptors, even for collections as large as 100 million images. Second, we show that descriptions selected for instance search improve the discovery of object classes. Third, the Markov Clustering technique consistently outperforms other methods; to our knowledge it has never been considered in this large scale scenario.
Social network stores and disseminates a tremendous amount of user shared images. Deep hashing is an efficient indexing technique to support large-scale social image retrieval, due to its deep representation capability, fast retrieval speed and low storage cost. Particularly, unsupervised deep hashing has well scalability as it does not require any manually labelled data for training. However, owing to the lacking of label guidance, existing methods suffer from severe semantic shortage when optimizing a large amount of deep neural network parameters. Differently, in this paper, we propose a Dual-level Semantic Transfer Deep Hashing (DSTDH) method to alleviate this problem with a unified deep hash learning framework. Our model targets at learning the semantically enhanced deep hash codes by specially exploiting the user-generated tags associated with the social images. Specifically, we design a complementary dual-level semantic transfer mechanism to efficiently discover the potential semantics of tags and seamlessly transfer them into binary hash codes. On the one hand, instance-level semantics are directly preserved into hash codes from the associated tags with adverse noise removing. Besides, an image-concept hypergraph is constructed for indirectly transferring the latent high-order semantic correlations of images and tags into hash codes. Moreover, the hash codes are obtained simultaneously with the deep representation learning by the discrete hash optimization strategy. Extensive experiments on two public social image retrieval datasets validate the superior performance of our method compared with state-of-the-art hashing methods. The source codes of our method can be obtained at https://github.com/research2020-1/DSTDH
For over a decade, TV series have been drawing increasing interest, both from the audience and from various academic fields. But while most viewers are hooked on the continuous plots of TV serials, the few annotated datasets available to researchers focus on standalone episodes of classical TV series. We aim at filling this gap by providing the multimedia/speech processing communities with Serial Speakers, an annotated dataset of 161 episodes from three popular American TV serials: Breaking Bad, Game of Thrones and House of Cards. Serial Speakers is suitable both for investigating multimedia retrieval in realistic use case scenarios, and for addressing lower level speech related tasks in especially challenging conditions. We publicly release annotations for every speech turn (boundaries, speaker) and scene boundary, along with annotations for shot boundaries, recurring shots, and interacting speakers in a subset of episodes. Because of copyright restrictions, the textual content of the speech turns is encrypted in the public version of the dataset, but we provide the users with a simple online tool to recover the plain text from their own subtitle files.
Aesthetic image analysis is the study and assessment of the aesthetic properties of images. Current computational approaches to aesthetic image analysis either provide accurate or interpretable results. To obtain both accuracy and interpretability by humans, we advocate the use of learned and nameable visual attributes as mid-level features. For this purpose, we propose to discover and learn the visual appearance of attributes automatically, using a recently introduced database, called AVA, which contains more than 250,000 images together with their aesthetic scores and textual comments given by photography enthusiasts. We provide a detailed analysis of these annotations as well as the context in which they were given. We then describe how these three key components of AVA - images, scores, and comments - can be effectively leveraged to learn visual attributes. Lastly, we show that these learned attributes can be successfully used in three applications: aesthetic quality prediction, image tagging and retrieval.