ترغب بنشر مسار تعليمي؟ اضغط هنا

Rethinking Temporal Object Detection from Robotic Perspectives

65   0   0.0 ( 0 )
 نشر من قبل Xingyu Chen
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Video object detection (VID) has been vigorously studied for years but almost all literature adopts a static accuracy-based evaluation, i.e., average precision (AP). From a robotic perspective, the importance of recall continuity and localization stability is equal to that of accuracy, but the AP is insufficient to reflect detectors performance across time. In this paper, non-reference assessments are proposed for continuity and stability based on object tracklets. These temporal evaluations can serve as supplements to static AP. Further, we develop an online tracklet refinement for improving detectors temporal performance through short tracklet suppression, fragment filling, and temporal location fusion. In addition, we propose a small-overlap suppression to extend VID methods to single object tracking (SOT) task so that a flexible SOT-by-detection framework is then formed. Extensive experiments are conducted on ImageNet VID dataset and real-world robotic tasks, where the superiority of our proposed approaches are validated and verified. Codes will be publicly available.



قيم البحث

اقرأ أيضاً

The convention standard for object detection uses a bounding box to represent each individual object instance. However, it is not practical in the industry-relevant applications in the context of warehouses due to severe occlusions among groups of in stances of the same categories. In this paper, we propose a new task, ie, simultaneously object localization and counting, abbreviated as Locount, which requires algorithms to localize groups of objects of interest with the number of instances. However, there does not exist a dataset or benchmark designed for such a task. To this end, we collect a large-scale object localization and counting dataset with rich annotations in retail stores, which consists of 50,394 images with more than 1.9 million object instances in 140 categories. Together with this dataset, we provide a new evaluation protocol and divide the training and testing subsets to fairly evaluate the performance of algorithms for Locount, developing a new benchmark for the Locount task. Moreover, we present a cascaded localization and counting network as a strong baseline, which gradually classifies and regresses the bounding boxes of objects with the predicted numbers of instances enclosed in the bounding boxes, trained in an end-to-end manner. Extensive experiments are conducted on the proposed dataset to demonstrate its significance and the analysis discussions on failure cases are provided to indicate future directions. Dataset is available at https://isrc.iscas.ac.cn/gitlab/research/locount-dataset.
Many modern robotics systems employ LiDAR as their main sensing modality due to its geometrical richness. Rolling shutter LiDARs are particularly common, in which an array of lasers scans the scene from a rotating base. Points are emitted as a stream of packets, each covering a sector of the 360{deg} coverage. Modern perception algorithms wait for the full sweep to be built before processing the data, which introduces an additional latency. For typical 10Hz LiDARs this will be 100ms. As a consequence, by the time an output is produced, it no longer accurately reflects the state of the world. This poses a challenge, as robotics applications require minimal reaction times, such that maneuvers can be quickly planned in the event of a safety-critical situation. In this paper we propose StrObe, a novel approach that minimizes latency by ingesting LiDAR packets and emitting a stream of detections without waiting for the full sweep to be built. StrObe reuses computations from previous packets and iteratively updates a latent spatial representation of the scene, which acts as a memory, as new evidence comes in, resulting in accurate low-latency perception. We demonstrate the effectiveness of our approach on a large scale real-world dataset, showing that StrObe far outperforms the state-of-the-art when latency is taken into account, and matches the performance in the traditional setting.
Object classification with 3D data is an essential component of any scene understanding method. It has gained significant interest in a variety of communities, most notably in robotics and computer graphics. While the advent of deep learning has prog ressed the field of 3D object classification, most work using this data type are solely evaluated on CAD model datasets. Consequently, current work does not address the discrepancies existing between real and artificial data. In this work, we examine this gap in a robotic context by specifically addressing the problem of classification when transferring from artificial CAD models to real reconstructed objects. This is performed by training on ModelNet (CAD models) and evaluating on ScanNet (reconstructed objects). We show that standard methods do not perform well in this task. We thus introduce a method that carefully samples object parts that are reproducible under various transformations and hence robust. Using graph convolution to classify the composed graph of parts, our method significantly improves upon the baseline.
DETR is a recently proposed Transformer-based method which views object detection as a set prediction problem and achieves state-of-the-art performance but demands extra-long training time to converge. In this paper, we investigate the causes of the optimization difficulty in the training of DETR. Our examinations reveal several factors contributing to the slow convergence of DETR, primarily the issues with the Hungarian loss and the Transformer cross attention mechanism. To overcome these issues we propose two solutions, namely, TSP-FCOS (Transformer-based Set Prediction with FCOS) and TSP-RCNN (Transformer-based Set Prediction with RCNN). Experimental results show that the proposed methods not only converge much faster than the original DETR, but also significantly outperform DETR and other baselines in terms of detection accuracy.
Recent advances in semi-supervised object detection (SSOD) are largely driven by consistency-based pseudo-labeling methods for image classification tasks, producing pseudo labels as supervisory signals. However, when using pseudo labels, there is a l ack of consideration in localization precision and amplified class imbalance, both of which are critical for detection tasks. In this paper, we introduce certainty-aware pseudo labels tailored for object detection, which can effectively estimate the classification and localization quality of derived pseudo labels. This is achieved by converting conventional localization as a classification task followed by refinement. Conditioned on classification and localization quality scores, we dynamically adjust the thresholds used to generate pseudo labels and reweight loss functions for each category to alleviate the class imbalance problem. Extensive experiments demonstrate that our method improves state-of-the-art SSOD performance by 1-2% and 4-6% AP on COCO and PASCAL VOC, respectively. In the limited-annotation regime, our approach improves supervised baselines by up to 10% AP using only 1-10% labeled data from COCO.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا