ﻻ يوجد ملخص باللغة العربية
Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.
Labeling objects at a subordinate level typically requires expert knowledge, which is not always available when using random annotators. As such, learning directly from web images for fine-grained recognition has attracted broad attention. However, t
With the tremendous advances of Convolutional Neural Networks (ConvNets) on object recognition, we can now obtain reliable enough machine-labeled annotations easily by predictions from off-the-shelf ConvNets. In this work, we present an abstraction m
Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank repres
Inspired by the observation that humans are able to process videos efficiently by only paying attention where and when it is needed, we propose an interpretable and easy plug-in spatial-temporal attention mechanism for video action recognition. For s
Human action recognition is an active research area in computer vision. Although great process has been made, previous methods mostly recognize actions based on depth data at only one scale, and thus they often neglect multi-scale features that provi