ﻻ يوجد ملخص باللغة العربية
Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which is beneficial for many applications, such as robot grasping and action recognition. However, existing methods mainly rely on the appearance feature of the objects to segment each region of the image, which face the following two problems: (i) there are multiple possible regions in an object that people interact with; and (ii) there are multiple possible human interactions in the same object region. To address these problems, we propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net has a dual-branch structure to process the demonstration video and object image. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the LSTM network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved stateof-the-art results for affordance grounding. The source code will be made available to the public.
In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV). The task aims at spatio-temporally localizing the given relations in the form of subject-predicate-object in the videos, so as to provide supportive visual facts
Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Langua
In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for h
Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video
There are rich synchronized audio and visual events in our daily life. Inside the events, audio scenes are associated with the corresponding visual objects; meanwhile, sounding objects can indicate and help to separate their individual sounds in the