ﻻ يوجد ملخص باللغة العربية
Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast architecture, where the slower network runs on sparsely sampled keyframes and the light-weight shallow network runs on non-keyframes at a high frame rate. We also propose an effective adversarial learning strategy to encourage the slow and fast network to learn similar features. Our approach ensures low-latency multi-task learning while maintaining high quality predictions. Experiments show competitive accuracy compared to state-of-the-art on two multi-task learning benchmarks while reducing the number of floating point operations (FLOPs) by up to 70%. In addition, our attention based feature propagation method (ILA) outperforms prior work in terms of task accuracy while also reducing up to 90% of FLOPs.
Detecting manipulated images and videos is an important topic in digital media forensics. Most detection methods use binary classification to determine the probability of a query being manipulated. Another important topic is locating manipulated regi
Analyzing and understanding hand information from multimedia materials like images or videos is important for many real world applications and remains active in research community. There are various works focusing on recovering hand information from
Multi-task learning is an open and challenging problem in computer vision. The typical way of conducting multi-task learning with deep neural networks is either through handcrafted schemes that share all initial layers and branch out at an adhoc poin
Deep learning has made significant impacts on multi-view stereo systems. State-of-the-art approaches typically involve building a cost volume, followed by multiple 3D convolution operations to recover the input images pixel-wise depth. While such end
Frame reconstruction (current or future frame) based on Auto-Encoder (AE) is a popular method for video anomaly detection. With models trained on the normal data, the reconstruction errors of anomalous scenes are usually much larger than those of nor