No Arabic abstract
A long-term goal of artificial intelligence is to have an agent execute commands communicated through natural language. In many cases the commands are grounded in a visual environment shared by the human who gives the command and the agent. Execution of the command then requires mapping the command into the physical visual space, after which the appropriate action can be taken. In this paper we consider the former. Or more specifically, we consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars. We provide a detailed comparison with related datasets such as ReferIt, RefCOCO, RefCOCO+, RefCOCOg, Cityscape-Ref and CLEVR-Ref. Additionally, we include a performance analysis using strong state-of-the-art models. The results show that the proposed object referral task is a challenging one for which the models show promising results but still require additional research in natural language processing, computer vision and the intersection of these fields. The dataset can be found on our website: http://macchina-ai.eu/
Current technology for autonomous cars primarily focuses on getting the passenger from point A to B. Nevertheless, it has been shown that passengers are afraid of taking a ride in self-driving cars. One way to alleviate this problem is by allowing the passenger to give natural language commands to the car. However, the car can misunderstand the issued command or the visual surroundings which could lead to uncertain situations. It is desirable that the self-driving car detects these situations and interacts with the passenger to solve them. This paper proposes a model that detects uncertain situations when a command is given and finds the visual objects causing it. Optionally, a question generated by the system describing the uncertain objects is included. We argue that if the car could explain the objects in a human-like way, passengers could gain more confidence in the cars abilities. Thus, we investigate how to (1) detect uncertain situations and their underlying causes, and (2) how to generate clarifying questions for the passenger. When evaluating on the Talk2Car dataset, we show that the proposed model, acrfull{pipeline}, improves gls{m:ambiguous-absolute-increase} in terms of $IoU_{.5}$ compared to not using gls{pipeline}. Furthermore, we designed a referring expression generator (REG) acrfull{reg_model} tailored to a self-driving car setting which yields a relative improvement of gls{m:meteor-relative} METEOR and gls{m:rouge-relative} ROUGE-l compared with state-of-the-art REG models, and is three times faster.
Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce TAXINLI, a new dataset, that has 10k examples from the MNLI dataset (Williams et al., 2018) with these taxonomic labels. Through various experiments on TAXINLI, we observe that whereas for certain taxonomic categories SOTA neural models have achieved near perfect accuracies - a large jump over the previous models - some categories still remain difficult. Our work adds to the growing body of literature that shows the gaps in the current NLI systems and datasets through a systematic presentation and analysis of reasoning categories.
Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalks generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better. The codes and the datasets are released on our project page https://github.com/Sha-Lab/babywalk.
A rather simple car driving simulator was created based on the available open source engine TORCS and used to analyze the basic features of human behavior in car driving within the car-following setups. Eight subjects with different skill in driving real cars participated in these experiments. They were instructed to drive a virtual car without overtaking the lead car driven by computer at a fixed speed and not to lose sight of it. Moreover, these experiments were conducted with four different speed including 60km/h, 80km/h, 100km/h, and 120km/h. Based on the collected data the distribution of the headway, velocity, acceleration, and jerk are constructed and compared with available experimental data collected previously by the analysis of the real traffic flow. A new model for car-following is proposed capture the found properties. As the main results we draw a conclusion that the human behavior in car driving should be categorized as a generalized intermittent control with noise-driven activation of the active phase. Besides, we hypothesize that the extended phase space required for modeling human actions in car driving has to comprise four phase variables, namely, the headway distance, the velocity of car, its acceleration, and the car jerk, i.e., the time derivative of the car acceleration. This time, the time pattern of pedal pushing and the distribution of time derivative of pedal was utilized in addition to previous variables. Moreover, all subjects driving data were categorized as some styles with their shapes.
With the growing phase of artificial intelligence and autonomous learning, the self-driving car is one of the promising area of research and emerging as a center of focus for automobile industries. Behavioral cloning is the process of replicating human behavior via visuomotor policies by means of machine learning algorithms. In recent years, several deep learning-based behavioral cloning approaches have been developed in the context of self-driving cars specifically based on the concept of transfer learning. Concerning the same, the present paper proposes a transfer learning approach using VGG16 architecture, which is fine tuned by retraining the last block while keeping other blocks as non-trainable. The performance of proposed architecture is further compared with existing NVIDIA architecture and its pruned variants (pruned by 22.2% and 33.85% using 1x1 filter to decrease the total number of parameters). Experimental results show that the VGG16 with transfer learning architecture has outperformed other discussed approaches with faster convergence.