ﻻ يوجد ملخص باللغة العربية
3D multi-object tracking is an important component in robotic perception systems such as self-driving vehicles. Recent work follows a tracking-by-detection pipeline, which aims to match past tracklets with detections in the current frame. To avoid matching with false positive detections, prior work filters out detections with low confidence scores via a threshold. However, finding a proper threshold is non-trivial, which requires extensive manual search via ablation study. Also, this threshold is sensitive to many factors such as target object category so we need to re-search the threshold if these factors change. To ease this process, we propose to automatically select high-quality detections and remove the efforts needed for manual threshold search. Also, prior work often uses a single threshold per data sequence, which is sub-optimal in particular frames or for certain objects. Instead, we dynamically search threshold per frame or per object to further boost performance. Through experiments on KITTI and nuScenes, our method can filter out $45.7%$ false positives while maintaining the recall, achieving new S.O.T.A. performance and removing the need for manually threshold tuning.
Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this poten
The main challenge of online multi-object tracking is to reliably associate object trajectories with detections in each video frame based on their tracking history. In this work, we propose the Recurrent Autoregressive Network (RAN), a temporal gener
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud. This paper investigates the reason behind this phenomenon. Due to the f
The ability to simultaneously track and reconstruct multiple objects moving in the scene is of the utmost importance for robotic tasks such as autonomous navigation and interaction. Virtually all of the previous attempts to map multiple dynamic objec
To reduce annotation labor associated with object detection, an increasing number of studies focus on transferring the learned knowledge from a labeled source domain to another unlabeled target domain. However, existing methods assume that the labele