No Arabic abstract
The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is semantic and spatial grounding, which is to infer objects and their spatial relationships from images and natural language expressions. We introduce a two-stage neural-network grounding pipeline that maps natural language referring expressions directly to objects in the images. The first stage uses visual descriptions in the referring expressions to generate a candidate set of relevant objects. The second stage examines all pairwise relationships between the candidates and predicts the most likely referred object according to the spatial descriptions in the referring expressions. A key feature of our system is that by leveraging a large dataset of images labeled with text descriptions, it allows unrestricted object types and natural language referring expressions. Preliminary results indicate that our system outperforms a near state-of-the-art object comprehension system on standard benchmark datasets. We also present a robot system that follows voice commands to pick and place previously unseen objects.
This paper presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred object. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans.
Intelligent robots designed to interact with humans in real scenarios need to be able to refer to entities actively by natural language. In spatial referring expression generation, the ambiguity is unavoidable due to the diversity of reference frames, which will lead to an understanding gap between humans and robots. To narrow this gap, in this paper, we propose a novel perspective-corrected spatial referring expression generation (PcSREG) approach for human-robot interaction by considering the selection of reference frames. The task of referring expression generation is simplified into the process of generating diverse spatial relation units. First, we pick out all landmarks in these spatial relation units according to the entropy of preference and allow its updating through a stack model. Then all possible referring expressions are generated according to different reference frame strategies. Finally, we evaluate every expression using a probabilistic referring expression resolution model and find the best expression that satisfies both of the appropriateness and effectiveness. We implement the proposed approach on a robot system and empirical experiments show that our approach can generate more effective spatial referring expressions for practical applications.
In this paper, we present an approach for robot learning of social affordance from human activity videos. We consider the problem in the context of human-robot interaction: Our approach learns structural representations of human-human (and human-object-human) interactions, describing how body-parts of each agent move with respect to each other and what spatial relations they should maintain to complete each sub-event (i.e., sub-goal). This enables the robot to infer its own movement in reaction to the human body motion, allowing it to naturally replicate such interactions. We introduce the representation of social affordance and propose a generative model for its weakly supervised learning from human demonstration videos. Our approach discovers critical steps (i.e., latent sub-events) in an interaction and the typical motion associated with them, learning what body-parts should be involved and how. The experimental results demonstrate that our Markov Chain Monte Carlo (MCMC) based learning algorithm automatically discovers semantically meaningful interactive affordance from RGB-D videos, which allows us to generate appropriate full body motion for an agent.
We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., ``largest elephant standing behind baby elephant. This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context -- visual attributes (e.g., ``largest, ``baby) and relationships (e.g., ``behind) that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Specifically, our framework exploits the reciprocal relation between the referent and context, i.e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced. In addition to reciprocity, our framework considers the semantic information of context, i.e., the referring expression can be reproduced based on the estimated context. We also extend the model to unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings.
Robot capabilities are maturing across domains, from self-driving cars, to bipeds and drones. As a result, robots will soon no longer be confined to safety-controlled industrial settings; instead, they will directly interact with the general public. The growing field of Human-Robot Interaction (HRI) studies various aspects of this scenario - from social norms to joint action to human-robot teams and more. Researchers in HRI have made great strides in developing models, methods, and algorithms for robots acting with and around humans, but these computational HRI models and algorithms generally do not come with formal guarantees and constraints on their operation. To enable human-interactive robots to move from the lab to real-world deployments, we must address this gap. This article provides an overview of verification, validation and synthesis techniques used to create demonstrably trustworthy systems, describes several HRI domains that could benefit from such techniques, and provides a roadmap for the challenges and the research needed to create formalized and guaranteed human-robot interaction.