ﻻ يوجد ملخص باللغة العربية
Visual perception entails solving a wide set of tasks, e.g., object detection, depth estimation, etc. The predictions made for multiple tasks from the same image are not independent, and therefore, are expected to be consistent. We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. The proposed formulation is based on inference-path invariance over a graph of arbitrary tasks. We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs. This framework also leads to an informative unsupervised quantity, called Consistency Energy, based on measuring the intrinsic consistency of the system. Consistency Energy correlates well with the supervised error (r=0.67), thus it can be employed as an unsupervised confidence metric as well as for detection of out-of-distribution inputs (ROC-AUC=0.95). The evaluations are performed on multiple datasets, including Taskonomy, Replica, CocoDoom, and ApolloScape, and they benchmark cross-task consistency versus various baselines including conventional multi-task learning, cycle consistency, and analytical consistency.
In this work we present a novel, robust transition generation technique that can serve as a new tool for 3D animators, based on adversarial recurrent neural networks. The system synthesizes high-quality motions that use temporally-sparse keyframes as
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.While previous methods rely on either rigging or inverse kinematics (IK) to associate a consistent skeleton with temporally
Fast appearance variations and the distractions of similar objects are two of the most challenging problems in visual object tracking. Unlike many existing trackers that focus on modeling only the target, in this work, we consider the emph{transient
Cross-modal retrieval is generally performed by projecting and aligning the data from two different modalities onto a shared representation space. This shared space often also acts as a bridge for translating the modalities. We address the problem of
We present a method that processes 3D point clouds by performing graph convolution operations across shapes. In this manner, point descriptors are learned by allowing interaction and propagation of feature representations within a shape collection. T