ﻻ يوجد ملخص باللغة العربية
Interactions between people are often governed by their relationships. On the flip side, social relationships are built upon several interactions. Two strangers are more likely to greet and introduce themselves while becoming friends over time. We are fascinated by this interplay between interactions and relationships, and believe that it is an important aspect of understanding social situations. In this work, we propose neural models to learn and jointly predict interactions, relationships, and the pair of characters that are involved. We note that interactions are informed by a mixture of visual and dialog cues, and present a multimodal architecture to extract meaningful information from them. Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels. We evaluate our models on the MovieGraphs dataset and show the impact of modalities, use of longer temporal context for predicting relationships, and achieve encouraging performance using weak labels as compared with ground-truth labels. Code is online.
We tackle the problem of learning the geometry of multiple categories of deformable objects jointly. Recent work has shown that it is possible to learn a unified dense pose predictor for several categories of related objects. However, training such m
While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as
Knowledge bases of entities and relations (either constructed manually or automatically) are behind many real world search engines, including those at Yahoo!, Microsoft, and Google. Those knowledge bases can be viewed as graphs with nodes representin
Automatically writing stylized Chinese characters is an attractive yet challenging task due to its wide applicabilities. In this paper, we propose a novel framework named Style-Aware Variational Auto-Encoder (SA-VAE) to flexibly generate Chinese char
The goal of the IARAI competition traffic4cast was to predict the city-wide traffic status within a 15-minute time window, based on information from the previous hour. The traffic status was given as multi-channel images (one pixel roughly correspond