No Arabic abstract
This work studies the problem of predicting the sequence of future actions for surround vehicles in real-world driving scenarios. To this aim, we make three main contributions. The first contribution is an automatic method to convert the trajectories recorded in real-world driving scenarios to action sequences with the help of HD maps. The method enables automatic dataset creation for this task from large-scale driving data. Our second contribution lies in applying the method to the well-known traffic agent tracking and prediction dataset Argoverse, resulting in 228,000 action sequences. Additionally, 2,245 action sequences were manually annotated for testing. The third contribution is to propose a novel action sequence prediction method by integrating past positions and velocities of the traffic agents, map information and social context into a single end-to-end trainable neural network. Our experiments prove the merit of the data creation method and the value of the created dataset - prediction performance improves consistently with the size of the dataset and shows that our action prediction method outperforms comparing models.
Visual localization and mapping is a crucial capability to address many challenges in mobile robotics. It constitutes a robust, accurate and cost-effective approach for local and global pose estimation within prior maps. Yet, in highly dynamic environments, like crowded city streets, problems arise as major parts of the image can be covered by dynamic objects. Consequently, visual odometry pipelines often diverge and the localization systems malfunction as detected features are not consistent with the precomputed 3D model. In this work, we present an approach to automatically detect dynamic object instances to improve the robustness of vision-based localization and mapping in crowded environments. By training a convolutional neural network model with a combination of synthetic and real-world data, dynamic object instance masks are learned in a semi-supervised way. The real-world data can be collected with a standard camera and requires minimal further post-processing. Our experiments show that a wide range of dynamic objects can be reliably detected using the presented method. Promising performance is demonstrated on our own and also publicly available datasets, which also shows the generalization capabilities of this approach.
COVID-19 pandemic has become a global challenge faced by people all over the world. Social distancing has been proved to be an effective practice to reduce the spread of COVID-19. Against this backdrop, we propose that the surveillance robots can not only monitor but also promote social distancing. Robots can be flexibly deployed and they can take precautionary actions to remind people of practicing social distancing. In this paper, we introduce a fully autonomous surveillance robot based on a quadruped platform that can promote social distancing in complex urban environments. Specifically, to achieve autonomy, we mount multiple cameras and a 3D LiDAR on the legged robot. The robot then uses an onboard real-time social distancing detection system to track nearby pedestrian groups. Next, the robot uses a crowd-aware navigation algorithm to move freely in highly dynamic scenarios. The robot finally uses a crowd-aware routing algorithm to effectively promote social distancing by using human-friendly verbal cues to send suggestions to over-crowded pedestrians. We demonstrate and validate that our robot can be operated autonomously by conducting several experiments in various urban scenarios.
This paper explores the use of a Bayesian non-parametric topic modeling technique for the purpose of anomaly detection in video data. We present results from two experiments. The first experiment shows that the proposed technique is automatically able characterize the underlying terrain, and detect anomalous flora in image data collected by an underwater robot. The second experiment shows that the same technique can be used on images from a static camera in a dynamic unstructured environment. In the second dataset, consisting of video data from a static seafloor camera capturing images of a busy coral reef, the proposed technique was able to detect all three instances of an underwater vehicle passing in front of the camera, amongst many other observations of fishes, debris, lighting changes due to surface waves, and benthic flora.
Object recognition in unseen indoor environments remains a challenging problem for visual perception of mobile robots. In this letter, we propose the use of topologically persistent features, which rely on the objects shape information, to address this challenge. In particular, we extract two kinds of features, namely, sparse persistence image (PI) and amplitude, by applying persistent homology to multi-directional height function-based filtrations of the cubical complexes representing the object segmentation maps. The features are then used to train a fully connected network for recognition. For performance evaluation, in addition to a widely used shape dataset and a benchmark indoor scenes dataset, we collect a new dataset, comprising scene images from two different environments, namely, a living room and a mock warehouse. The scenes are captured using varying camera poses under different illumination conditions and include up to five different objects from a given set of fourteen objects. On the benchmark indoor scenes dataset, sparse PI features show better recognition performance in unseen environments than the features learned using the widely used ResNetV2-56 and EfficientNet-B4 models. Further, they provide slightly higher recall and accuracy values than Faster R-CNN, an end-to-end object detection method, and its state-of-the-art variant, Domain Adaptive Faster R-CNN. The performance of our methods also remains relatively unchanged from the training environment (living room) to the unseen environment (mock warehouse) in the new dataset. In contrast, the performance of the object detection methods drops substantially. We also implement the proposed method on a real-world robot to demonstrate its usefulness.
One of the major challenges for autonomous vehicles in urban environments is to understand and predict other road users actions, in particular, pedestrians at the point of crossing. The common approach to solving this problem is to use the motion history of the agents to predict their future trajectories. However, pedestrians exhibit highly variable actions most of which cannot be understood without visual observation of the pedestrians themselves and their surroundings. To this end, we propose a solution for the problem of pedestrian action anticipation at the point of crossing. Our approach uses a novel stacked RNN architecture in which information collected from various sources, both scene dynamics and visual features, is gradually fused into the network at different levels of processing. We show, via extensive empirical evaluations, that the proposed algorithm achieves a higher prediction accuracy compared to alternative recurrent network architectures. We conduct experiments to investigate the impact of the length of observation, time to event and types of features on the performance of the proposed method. Finally, we demonstrate how different data fusion strategies impact prediction accuracy.