Do you want to publish a course? Click here

Scene-Aware Audio for 360textdegree{} Videos

94   0   0.0 ( 0 )
 Added by Dingzeyu Li
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

Although 360textdegree{} cameras ease the capture of panoramic footage, it remains challenging to add realistic 360textdegree{} audio that blends into the captured scene and is synchronized with the camera motion. We present a method for adding scene-aware spatial audio to 360textdegree{} videos in typical indoor scenes, using only a conventional mono-channel microphone and a speaker. We observe that the late reverberation of a rooms impulse response is usually diffuse spatially and directionally. Exploiting this fact, we propose a method that synthesizes the directional impulse response between any source and listening locations by combining a synthesized early reverberation part and a measured late reverberation tail. The early reverberation is simulated using a geometric acoustic simulation and then enhanced using a frequency modulation method to capture room resonances. The late reverberation is extracted from a recorded impulse response, with a carefully chosen time duration that separates out the late reverberation from the early reverberation. In our validations, we show that our synthesized spatial audio matches closely with recordings using ambisonic microphones. Lastly, we demonstrate the strength of our method in several applications.



rate research

Read More

A long-standing challenge in scene analysis is the recovery of scene arrangements under moderate to heavy occlusion, directly from monocular video. While the problem remains a subject of active research, concurrent advances have been made in the context of human pose reconstruction from monocular video, including image-space feature point detection and 3D pose recovery. These methods, however, start to fail under moderate to heavy occlusion as the problem becomes severely under-constrained. We approach the problems differently. We observe that people interact similarly in similar scenes. Hence, we exploit the correlation between scene object arrangement and motions performed in that scene in both directions: first, typical motions performed when interacting with objects inform us about possible object arrangements; and second, object arrangements, in turn, constrain the possible motions. We present iMapper, a data-driven method that focuses on identifying human-object interactions, and jointly reasons about objects and human movement over space-time to recover both a plausible scene arrangement and consistent human interactions. We first introduce the notion of characteristic interactions as regions in space-time when an informative human-object interaction happens. This is followed by a novel occlusion-aware matching procedure that searches and aligns such characteristic snapshots from an interaction database to best explain the input monocular video. Through extensive evaluations, both quantitative and qualitative, we demonstrate that iMapper significantly improves performance over both dedicated state-of-the-art scene analysis and 3D human pose recovery approaches, especially under medium to heavy occlusion.
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input video
We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
Spatial computing experiences are constrained by the real-world surroundings of the user. In such experiences, augmenting virtual objects to existing scenes require a contextual approach, where geometrical conflicts are avoided, and functional and plausible relationships to other objects are maintained in the target environment. Yet, due to the complexity and diversity of user environments, automatically calculating ideal positions of virtual content that is adaptive to the context of the scene is considered a challenging task. Motivated by this problem, in this paper we introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes. SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content. We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room. We believe providing explicit and intuitive features plays an important role in informative content creation and user interaction of spatial computing settings, a quality that is not captured in implicit models. We use kernel density estimation (KDE) to build a multivariate conditional knowledge model trained using prior spatial Scene Graphs extracted from real-world 3D scanned data. To further capture orientational properties, we develop a fast pose annotation tool to extend current real-world datasets with orientational labels. Finally, to demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.
The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا