ﻻ يوجد ملخص باللغة العربية
Despite their irresistible success, deep learning algorithms still heavily rely on annotated data. On the other hand, unsupervised settings pose many challenges, especially about determining the right inductive bias in diverse scenarios. One scalable solution is to make the model generate the supervision for itself by leveraging some part of the input data, which is known as self-supervised learning. In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. In addition to disentangling the notion of objects and the motion dynamics, our compositional structure explicitly handles occlusion and inpaints inferred objects and background for the composition of the predicted frame. With the aid of auxiliary loss functions that promote spatially and temporally consistent object representations, our self-supervised framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, r
Action recognition in videos has attracted a lot of attention in the past decade. In order to learn robust models, previous methods usually assume videos are trimmed as short sequences and require ground-truth annotations of each video frame/sequence
In robotic applications, we often face the challenge of discovering new objects while having very little or no labelled training data. In this paper we explore the use of self-supervision provided by a robot traversing an environment to learn represe
Training high-accuracy object detection models requires large and diverse annotated datasets. However, creating these data-sets is time-consuming and expensive since it relies on human annotators. We design, implement, and evaluate TagMe, a new appro
Deep-learning-based algorithms have led to impressive results in visual-saliency prediction, but the impact of noise in training gaze data has been largely overlooked. This issue is especially relevant for videos, where the gaze data tends to be inco