ﻻ يوجد ملخص باللغة العربية
The recent introduction of the AVA dataset for action detection has caused a renewed interest to this problem. Several approaches have been recently proposed that improved the performance. However, all of them have ignored the main difficulty of the AVA dataset - its realistic distribution of training and test examples. This dataset was collected by exhaustive annotation of human action in uncurated videos. As a result, the most common categories, such as `stand or `sit, contain tens of thousands of examples, whereas rare ones have only dozens. In this work we study the problem of action detection in a highly-imbalanced dataset. Differently from previous work on handling long-tail category distributions, we begin by analyzing the imbalance in the test set. We demonstrate that the standard AP metric is not informative for the categories in the tail, and propose an alternative one - Sampled AP. Armed with this new measure, we study the problem of transferring representations from the data-rich head to the rare tail categories and propose a simple but effective approach.
Detection and recognition of text in natural images are two main problems in the field of computer vision that have a wide variety of applications in analysis of sports videos, autonomous driving, industrial automation, to name a few. They face commo
Affordance detection refers to identifying the potential action possibilities of objects in an image, which is a crucial ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we first study the challe
In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre
Generative Adversarial Networks (GANs) can generate realistic fake face images that can easily fool human beings.On the contrary, a common Convolutional Neural Network(CNN) discriminator can achieve more than 99.9% accuracyin discerning fake/real ima
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the