No Arabic abstract
For over a decade, TV series have been drawing increasing interest, both from the audience and from various academic fields. But while most viewers are hooked on the continuous plots of TV serials, the few annotated datasets available to researchers focus on standalone episodes of classical TV series. We aim at filling this gap by providing the multimedia/speech processing communities with Serial Speakers, an annotated dataset of 161 episodes from three popular American TV serials: Breaking Bad, Game of Thrones and House of Cards. Serial Speakers is suitable both for investigating multimedia retrieval in realistic use case scenarios, and for addressing lower level speech related tasks in especially challenging conditions. We publicly release annotations for every speech turn (boundaries, speaker) and scene boundary, along with annotations for shot boundaries, recurring shots, and interacting speakers in a subset of episodes. Because of copyright restrictions, the textual content of the speech turns is encrypted in the public version of the dataset, but we provide the users with a simple online tool to recover the plain text from their own subtitle files.
Speaker diarization may be difficult to achieve when applied to narrative films, where speakers usually talk in adverse acoustic conditions: background music, sound effects, wide variations in intonation may hide the inter-speaker variability and make audio-based speaker diarization approaches error prone. On the other hand, such fictional movies exhibit strong regularities at the image level, particularly within dialogue scenes. In this paper, we propose to perform speaker diarization within dialogue scenes of TV series by combining the audio and video modalities: speaker diarization is first performed by using each modality, the two resulting partitions of the instance set are then optimally matched, before the remaining instances, corresponding to cases of disagreement between both modalities, are finally processed. The results obtained by applying such a multi-modal approach to fictional films turn out to outperform those obtained by relying on a single modality.
Cable TV news reaches millions of U.S. households each day, meaning that decisions about who appears on the news and what stories get covered can profoundly influence public opinion and discourse. We analyze a data set of nearly 24/7 video, audio, and text captions from three U.S. cable TV networks (CNN, FOX, and MSNBC) from January 2010 to July 2019. Using machine learning tools, we detect faces in 244,038 hours of video, label each faces presented gender, identify prominent public figures, and align text captions to audio. We use these labels to perform screen time and word frequency analyses. For example, we find that overall, much more screen time is given to male-presenting individuals than to female-presenting individuals (2.4x in 2010 and 1.9x in 2019). We present an interactive web-based tool, accessible at https://tvnews.stanford.edu, that allows the general public to perform their own analyses on the full cable TV news data set.
Research on mid-level image representations has conventionally concentrated relatively obvious attributes and overlooked non-obvious attributes, i.e., characteristics that are not readily observable when images are viewed independently of their context or function. Non-obvious attributes are not necessarily easily nameable, but nonetheless they play a systematic role in people`s interpretation of images. Clusters of related non-obvious attributes, called interpretation dimensions, emerge when people are asked to compare images, and provide important insight on aspects of social images that are considered relevant. In contrast to aesthetic or affective approaches to image analysis, non-obvious attributes are not related to the personal perspective of the viewer. Instead, they encode a conventional understanding of the world, which is tacit, rather than explicitly expressed. This paper introduces a procedure for discovering non-obvious attributes using crowdsourcing. We discuss this procedure using a concrete example of a crowdsourcing task on Amazon Mechanical Turk carried out in the domain of fashion. An analysis comparing discovered non-obvious attributes with user tags demonstrated the added value delivered by our procedure.
With the emerging of various online video platforms like Youtube, Youku and LeTV, online TV series reviews become more and more important both for viewers and producers. Customers rely heavily on these reviews before selecting TV series, while producers use them to improve the quality. As a result, automatically classifying reviews according to different requirements evolves as a popular research topic and is essential in our daily life. In this paper, we focused on reviews of hot TV series in China and successfully trained generic classifiers based on eight predefined categories. The experimental results showed promising performance and effectiveness of its generalization to different TV series.
In this paper, we propose a two-stage ranking approach for recommending linear TV programs. The proposed approach first leverages user viewing patterns regarding time and TV channels to identify potential candidates for recommendation and then further leverages user preferences to rank these candidates given textual information about programs. To evaluate the method, we conduct empirical studies on a real-world TV dataset, the results of which demonstrate the superior performance of our model in terms of both recommendation accuracy and time efficiency.