UNOC: Understanding Occlusion for Embodied Presence in Virtual Reality

72 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mathias Parger

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Mathias Parger - Chengcheng Tang - Yuanlu Xu

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Tracking body and hand motions in the 3D space is essential for social and self-presence in augmented and virtual environments. Unlike the popular 3D pose estimation setting, the problem is often formulated as inside-out tracking based on embodied perception (e.g., egocentric cameras, handheld sensors). In this paper, we propose a new data-driven framework for inside-out body tracking, targeting challenges of omnipresent occlusions in optimization-based methods (e.g., inverse kinematics solvers). We first collect a large-scale motion capture dataset with both body and finger motions using optical markers and inertial sensors. This dataset focuses on social scenarios and captures ground truth poses under self-occlusions and body-hand interactions. We then simulate the occlusion patterns in head-mounted camera views on the captured ground truth using a ray casting algorithm and learn a deep neural network to infer the occluded body parts. In the experiments, we show that our method is able to generate high-fidelity embodied poses by applying the proposed method on the task of real-time inside-out body tracking, finger motion synthesis, and 3-point inverse kinematics.

قيم البحث

80 - Sajad Saeedi 2018

Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonom ous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي علم الروبوتات

Federated Echo State Learning for Minimizing Breaks in Presence in Wireless Virtual Reality Networks

75 - Mingzhe Chen , Omid Semiari , Walid Saad 2018

In this paper, the problem of enhancing the virtual reality (VR) experience for wireless users is investigated by minimizing the occurrence of breaks in presence (BIP) that can detach the users from their virtual world. To measure the BIP for wireles s VR users, a novel model that jointly considers the VR application type, transmission delay, VR video quality, and users awareness of the virtual environment is proposed. In the developed model, the base stations (BSs) transmit VR videos to the wireless VR users using directional transmission links so as to provide high data rates for the VR users, thus, reducing the number of BIP for each user. Since the body movements of a VR user may result in a blockage of its wireless link, the location and orientation of VR users must also be considered when minimizing BIP. The BIP minimization problem is formulated as an optimization problem which jointly considers the predictions of users locations, orientations, and their BS association. To predict the orientation and locations of VR users, a distributed learning algorithm based on the machine learning framework of deep (ESNs) is proposed. The proposed algorithm uses concept from federated learning to enable multiple BSs to locally train their deep ESNs using their collected data and cooperatively build a learning model to predict the entire users locations and orientations. Using these predictions, the user association policy that minimizes BIP is derived. Simulation results demonstrate that the developed algorithm reduces the users BIP by up to 16% and 26%, respectively, compared to centralized ESN and deep learning algorithms.

نظرية المعلومات نظرية المعلومات

Language (Re)modelling: Towards Embodied Language Understanding

190 - Ronen Tamari , Chen Shani , Tom Hope 2020

While natural language understanding (NLU) is advancing rapidly, todays technology differs from human-like language understanding in fundamental ways, notably in its inferior efficiency, interpretability, and generalization. This work proposes an app roach to representation and learning based on the tenets of embodied cognitive linguistics (ECL). According to ECL, natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction. This position paper argues that the use of grounding by metaphoric inference and simulation will greatly benefit NLU systems, and proposes a system architecture along with a roadmap towards realizing this vision.

الحساب واللغة التعلم الآلي

YouRefIt: Embodied Reference Understanding with Language and Gesture

126 - Yixin Chen , Qing Li , Deqian Kong 2021

We study the understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Of note, this new visual task requires understanding multimodal cues with perspective- taking to identify which object is being referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced dataset of embodied reference collected in various physical scenes; the dataset contains 4,195 unique reference clips in 432 indoor scenes. To the best of our knowledge, this is the first embodied reference dataset that allows us to study referring expressions in daily physical scenes to understand referential behavior, human communication, and human-robot interaction. We further devise two benchmarks for image-based and video-based embodied reference understanding. Comprehensive baselines and extensive experiments provide the very first result of machine perception on how the referring expressions and gestures affect the embodied reference understanding. Our results provide essential evidence that gestural cues are as critical as language cues in understanding the embodied reference.

الرؤية الحاسوبية وتمييز الأنماط

Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality

209 - Amin Jourabloo , Fernando De la Torre , Jason Saragih 2021

Social presence, the feeling of being there with a real person, will fuel the next generation of communication systems driven by digital humans in virtual reality (VR). The best 3D video-realistic VR avatars that minimize the uncanny effect rely on p erson-specific (PS) models. However, these PS models are time-consuming to build and are typically trained with limited data variability, which results in poor generalization and robustness. Major sources of variability that affects the accuracy of facial expression transfer algorithms include using different VR headsets (e.g., camera configuration, slop of the headset), facial appearance changes over time (e.g., beard, make-up), and environmental factors (e.g., lighting, backgrounds). This is a major drawback for the scalability of these models in VR. This paper makes progress in overcoming these limitations by proposing an end-to-end multi-identity architecture (MIA) trained with specialized augmentation strategies. MIA drives the shape component of the avatar from three cameras in the VR headset (two eyes, one mouth), in untrained subjects, using minimal personalized information (i.e., neutral 3D mesh shape). Similarly, if the PS texture decoder is available, MIA is able to drive the full avatar (shape+texture) robustly outperforming PS models in challenging scenarios. Our key contribution to improve robustness and generalization, is that our method implicitly decouples, in an unsupervised manner, the facial expression from nuisance factors (e.g., headset, environment, facial appearance). We demonstrate the superior performance and robustness of the proposed method versus state-of-the-art PS approaches in a variety of experiments.

الرؤية الحاسوبية وتمييز الأنماط