Optimality and limitations of audio-visual integration for cognitive systems

389 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل William Paul Boyce

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف W. Paul Boyce - Tony Lindsay - Arkady Zgonnikov

الذكاء الاصطناعي تفاعل الإنسان والحاسوب الخلايا العصبية والإدراك

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Multimodal integration is an important process in perceptual decision-making. In humans, this process has often been shown to be statistically optimal, or near optimal: sensory information is combined in a fashion that minimises the average error in perceptual representation of stimuli. However, sometimes there are costs that come with the optimization, manifesting as illusory percepts. We review audio-visual facilitations and illusions that are products of multisensory integration, and the computational models that account for these phenomena. In particular, the same optimal computational model can lead to illusory percepts, and we suggest that more studies should be needed to detect and mitigate these illusions, as artefacts in artificial cognitive systems. We provide cautionary considerations when designing artificial cognitive systems with the view of avoiding such artefacts. Finally, we suggest avenues of research towards solutions to potential pitfalls in system design. We conclude that detailed understanding of multisensory integration and the mechanisms behind audio-visual illusions can benefit the design of artificial cognitive systems.

قيم البحث

65 - Steve DiPaola , Liane Gabora , 2018

The common view that our creativity is what makes us uniquely human suggests that incorporating research on human creativity into generative deep learning techniques might be a fruitful avenue for making their outputs more compelling and human-like. Using an original synthesis of Deep Dream-based convolutional neural networks and cognitive based computational art rendering systems, we show how honing theory, intrinsic motivation, and the notion of a seed incident can be implemented computationally, and demonstrate their impact on the resulting generative art. Conversely, we discuss how explorations in deep learn-ing convolutional neural net generative systems can inform our understanding of human creativity. We conclude with ideas for further cross-fertilization between AI based computational creativity and psychology of creativity.

الذكاء الاصطناعي التعلم الآلي الخلايا العصبية والإدراك

SoundSpaces: Audio-Visual Navigation in 3D Environments

290 - Changan Chen , Unnat Jain , Carl Schissler 2019

Moving around in the world is naturally a multisensory experience, but todays embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually real istic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception.

الرؤية الحاسوبية وتمييز الأنماط تفاعل الإنسان والحاسوب أنظمة الصوت في الحاسوب

Themes Informed Audio-visual Correspondence Learning

165 - Runze Su , Fei Tao , Xudong Liu 2020

The applications of short-term user-generated video (UGV), such as Snapchat, and Youtube short-term videos, booms recently, raising lots of multimodal machine learning tasks. Among them, learning the correspondence between audio and visual informatio n from videos is a challenging one. Most previous work of the audio-visual correspondence(AVC) learning only investigated constrained videos or simple settings, which may not fit the application of UGV. In this paper, we proposed new principles for AVC and introduced a new framework to set sight of videos themes to facilitate AVC learning. We also released the KWAI-AD-AudVis corpus which contained 85432 short advertisement videos (around 913 hours) made by users. We evaluated our proposed approach on this corpus, and it was able to outperform the baseline by 23.15% absolute difference.

الذكاء الاصطناعي الوسائط المتعددة التعلم الالي

Using Cognitive Models to Train Warm Start Reinforcement Learning Agents for Human-Computer Interactions

101 - Chao Zhang , Shihan Wang , Henk Aarts 2021

Reinforcement learning (RL) agents in human-computer interactions applications require repeated user interactions before they can perform well. To address this cold start problem, we propose a novel approach of using cognitive models to pre-train RL agents before they are applied to real users. After briefly reviewing relevant cognitive models, we present our general methodological approach, followed by two case studies from our previous and ongoing projects. We hope this position paper stimulates conversations between RL, HCI, and cognitive science researchers in order to explore the full potential of the approach.

الذكاء الاصطناعي تفاعل الإنسان والحاسوب

Can audio-visual integration strengthen robustness under multimodal attacks?

121 - Yapeng Tian , Chenliang Xu 2021

In this paper, we propose to make a systematic study on machines multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual lea rning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.

الرؤية الحاسوبية وتمييز الأنماط التشفير والأمن أنظمة الصوت في الحاسوب