FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network

220 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Sohini Roychowdhury

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Francesco Piccoli - Rajarathnam Balakrishnan - Maria Jesus Perez

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي معالجة الصور والفيديو

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Pedestrian intention recognition is very important to develop robust and safe autonomous driving (AD) and advanced driver assistance systems (ADAS) functionalities for urban driving. In this work, we develop an end-to-end pedestrian intention framework that performs well on day- and night- time scenarios. Our framework relies on objection detection bounding boxes combined with skeletal features of human pose. We study early, late, and combined (early and late) fusion mechanisms to exploit the skeletal features and reduce false positives as well to improve the intention prediction performance. The early fusion mechanism results in AP of 0.89 and precision/recall of 0.79/0.89 for pedestrian intention classification. Furthermore, we propose three new metrics to properly evaluate the pedestrian intention systems. Under these new evaluation metrics for the intention prediction, the proposed end-to-end network offers accurate pedestrian intention up to half a second ahead of the actual risky maneuver.

قيم البحث

347 - Junting Pan , Siyu Chen , Mike Zheng Shou 2020

Localizing persons and recognizing their actions from videos is a challenging task towards high-level video understanding. Recent advances have been achieved by modeling direct pairwise relations between entities. In this paper, we take one step furt her, not only model direct relations between pairs but also take into account indirect higher-order relations established upon multiple elements. We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context. To this end, we design an Actor-Context-Actor Relation Network (ACAR-Net) which builds upon a novel High-order Relation Reasoning Operator and an Actor-Context Feature Bank to enable indirect relation reasoning for spatio-temporal action localization. Experiments on AVA and UCF101-24 datasets show the advantages of modeling actor-context-actor relations, and visualization of attention maps further verifies that our model is capable of finding relevant higher-order relations to support action detection. Notably, our method ranks first in the AVA-Kineticsaction localization task of ActivityNet Challenge 2020, out-performing other entries by a significant margin (+6.71mAP). Training code and models will be available at https://github.com/Siyu-C/ACAR-Net.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي معالجة الصور والفيديو

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

169 - Ziqiang Wang , Zhi Liu , Gongyang Li 2021

3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. However, 3D convolution encodes visual representation merely on fixed local space time according to its kernel size, while human attention is always attracted by relational visual features at different time of a video. To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-temporal features of different time steps. Besides, we propose an Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features with the perception of context in semantic and spatio-temporal subspaces. Extensive experiments demonstrate the contributions of key components of our method, and the results on DHF1K, Hollywood-2, UCF, and DIEM benchmark datasets clearly prove the superiority of the proposed model compared with all state-of-the-art models.

الرؤية الحاسوبية وتمييز الأنماط

Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction

218 - Cunjun Yu , Xiao Ma , Jiawei Ren 2020

Understanding crowd motion dynamics is critical to real-world applications, e.g., surveillance systems and autonomous driving. This is challenging because it requires effectively modeling the socially aware crowd spatial interaction and complex tempo ral dependencies. We believe attention is the most important factor for trajectory prediction. In this paper, we present STAR, a Spatio-Temporal grAph tRansformer framework, which tackles trajectory prediction by only attention mechanisms. STAR models intra-graph crowd interaction by TGConv, a novel Transformer-based graph convolution mechanism. The inter-graph temporal dependencies are modeled by separate temporal Transformers. STAR captures complex spatio-temporal interactions by interleaving between spatial and temporal Transformers. To calibrate the temporal prediction for the long-lasting effect of disappeared pedestrians, we introduce a read-writable external memory module, consistently being updated by the temporal Transformer. We show that with only attention mechanism, STAR achieves state-of-the-art performance on 5 commonly used real-world pedestrian prediction datasets.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي علم الروبوتات

Exploiting Event Cameras for Spatio-Temporal Prediction of Fast-Changing Trajectories

84 - Marco Monforte , Ander Arriandiaga , Arren Glover 2020

This paper investigates trajectory prediction for robotics, to improve the interaction of robots with moving targets, such as catching a bouncing ball. Unexpected, highly-non-linear trajectories cannot easily be predicted with regression-based fittin g procedures, therefore we apply state of the art machine learning, specifically based on Long-Short Term Memory (LSTM) architectures. In addition, fast moving targets are better sensed using event cameras, which produce an asynchronous output triggered by spatial change, rather than at fixed temporal intervals as with traditional cameras. We investigate how LSTM models can be adapted for event camera data, and in particular look at the benefit of using asynchronously sampled data.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي علم الروبوتات

STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction

129 - Zhishuai Zhang , Jiyang Gao , Junhua Mao 2020

Detecting pedestrians and predicting future trajectories for them are critical tasks for numerous applications, such as autonomous driving. Previous methods either treat the detection and prediction as separate tasks or simply add a trajectory regres sion head on top of a detector. In this work, we present a novel end-to-end two-stage network: Spatio-Temporal-Interactive Network (STINet). In addition to 3D geometry modeling of pedestrians, we model the temporal information for each of the pedestrians. To do so, our method predicts both current and past locations in the first stage, so that each pedestrian can be linked across frames and the comprehensive spatio-temporal information can be captured in the second stage. Also, we model the interaction among objects with an interaction graph, to gather the information among the neighboring objects. Comprehensive experiments on the Lyft Dataset and the recently released large-scale Waymo Open Dataset for both object detection and future trajectory prediction validate the effectiveness of the proposed method. For the Waymo Open Dataset, we achieve a bird-eyes-view (BEV) detection AP of 80.73 and trajectory prediction average displacement error (ADE) of 33.67cm for pedestrians, which establish the state-of-the-art for both tasks.

الرؤية الحاسوبية وتمييز الأنماط