أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Devi Parikh

Dance2Music: Automatic Dance-driven Music Generation

124 - Gunjan Aggarwal , Devi Parikh 2021

Dance and music typically go hand in hand. The complexities in dance, music, and their synchronisation make them fascinating to study from a computational creativity perspective. While several works have looked at generating dance for a given music, automatically generating music for a given dance remains under-explored. This capability could have several creative expression and entertainment applications. We present some early explorations in this direction. We present a search-based offline approach that generates music after processing the entire dance video and an online approach that uses a deep neural network to generate music on-the-fly as the video proceeds. We compare these approaches to a strong heuristic baseline via human studies and present our findings. We have integrated our online approach in a live demo! A video of the demo can be found here: https://sites.google.com/view/dance2music/live-demo.

أنظمة الصوت في الحاسوب الوسائط المتعددة معالجة الصوت والكلام

Visual Conceptual Blending with Large-scale Language and Vision Models

163 - Songwei Ge , Devi Parikh 2021

We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.

الحساب واللغة الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط

Building Bridges: Generative Artworks to Explore AI Ethics

274 - Ramya Srinivasan , Devi Parikh 2021

In recent years, there has been an increased emphasis on understanding and mitigating adverse impacts of artificial intelligence (AI) technologies on society. Across academia, industry, and government bodies, a variety of endeavours are being pursued towards enhancing AI ethics. A significant challenge in the design of ethical AI systems is that there are multiple stakeholders in the AI pipeline, each with their own set of constraints and interests. These different perspectives are often not understood, due in part to communication gaps.For example, AI researchers who design and develop AI models are not necessarily aware of the instability induced in consumers lives by the compounded effects of AI decisions. Educating different stakeholders about their roles and responsibilities in the broader context becomes necessary. In this position paper, we outline some potential ways in which generative artworks can play this role by serving as accessible and powerful educational tools for surfacing different perspectives. We hope to spark interdisciplinary discussions about computational creativity broadly as a tool for enhancing AI ethics.

أجهزة الكمبيوتر والمجتمع الذكاء الاصطناعي

KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA

149 - Kenneth Marino , Xinlei Chen , Devi Parikh 2020

One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not giv en/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models. Second, explicit, symbolic knowledge encoded in knowledge bases. Our approach combines both - exploiting the powerful implicit reasoning of transformer models for answer prediction, and integrating symbolic representations from a knowledge graph, while never losing their explicit semantics to an implicit embedding. We combine diverse sources of knowledge to cover the wide variety of knowledge needed to solve knowledge-based questions. We show our approach, KRISP (Knowledge Reasoning with Implicit and Symbolic rePresentations), significantly outperforms state-of-the-art on OK-VQA, the largest available dataset for open-domain knowledge-based VQA. We show with extensive ablations that while our model successfully exploits implicit knowledge reasoning, the symbolic answer module which explicitly connects the knowledge graph to the answer vocabulary is critical to the performance of our method and generalizes to rare answers.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Neuro-Symbolic Generative Art: A Preliminary Study

347 - Gunjan Aggarwal , Devi Parikh 2020

There are two classes of generative art approaches: neural, where a deep model is trained to generate samples from a data distribution, and symbolic or algorithmic, where an artist designs the primary parameters and an autonomous system generates sam ples within these constraints. In this work, we propose a new hybrid genre: neuro-symbolic generative art. As a preliminary study, we train a generative deep neural network on samples from the symbolic approach. We demonstrate through human studies that subjects find the final artifacts and the creation process using our neuro-symbolic approach to be more creative than the symbolic approach 61% and 82% of the time respectively.

الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

131 - Amanpreet Singh , Vedanuj Goswami , Devi Parikh 2020

Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice o f pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than natural data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work out of the box yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

170 - Ramprasaath R. Selvaraju , Purva Tendulkar , Devi Parikh 2020

Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of que stions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. Analyzing performance across this distinction allows us to notice when existing VQA models have consistency issues; they answer the reasoning questions correctly but fail on associated low-level perception questions. For example, in Figure 1, models answer the complex reasoning question Is the banana ripe enough to eat? correctly, but fail on the associated perception question Are the bananas mostly green or yellow? indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend to the same parts of the image when answering the reasoning question and the perception sub question. We show that SQuINT improves model consistency by ~5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي الحساب واللغة

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

115 - Satwik Kottur , Jose M. F. Moura , Devi Parikh 2019

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolatio n on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the state of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our dataset and code are publicly available.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي الحساب واللغة

Towards Transparent AI Systems: Interpreting Visual Question Answering Models

245 - Yash Goyal , Akrit Mohapatra , Devi Parikh 2016

Deep neural networks have shown striking progress and obtained state-of-the-art results in many AI research fields in the recent years. However, it is often unsatisfying to not know why they predict what they do. In this paper, we address the problem of interpreting Visual Question Answering (VQA) models. Specifically, we are interested in finding what part of the input (pixels in images or words in questions) the VQA model focuses on while answering the question. To tackle this problem, we use two visualization techniques -- guided backpropagation and occlusion -- to find important words in the question and important regions in the image. We then present qualitative and quantitative analyses of these importance maps. We found that even without explicit attention mechanisms, VQA models may sometimes be implicitly attending to relevant regions in the image, and often to appropriate words in the question.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي الحساب واللغة

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد