ﻻ يوجد ملخص باللغة العربية
In order to resonate with the viewers, many video advertisements explore creative narrative techniques such as Freytags pyramid where a story begins with exposition, followed by rising action, then climax, concluding with denouement. In the dramatic structure of ads in particular, climax depends on changes in sentiment. We dedicate our study to understand the dynamic structure of video ads automatically. To achieve this, we first crowdsource climax annotations on 1,149 videos from the Video Ads Dataset, which already provides sentiment annotations. We then use both unsupervised and supervised methods to predict the climax. Based on the predicted peak, the low-level visual and audio cues, and semantically meaningful context features, we build a sentiment prediction model that outperforms the current state-of-the-art model of sentiment prediction in video ads by 25%. In our ablation study, we show that using our context features, and modeling dynamics with an LSTM, are both crucial factors for improved performance.
There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action. We propose the novel problem of automatic advertisement understanding. To enable research on this pr
Automated tagging of video advertisements has been a critical yet challenging problem, and it has drawn increasing interests in last years as its applications seem to be evident in many fields. Despite sustainable efforts have been made, the tagging
Deformable convolution, originally proposed for the adaptation to geometric variations of objects, has recently shown compelling performance in aligning multiple frames and is increasingly adopted for video super-resolution. Despite its remarkable pe
Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a signi
With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a con