No Arabic abstract
Deploying deep learning models on embedded systems has been challenging due to limited computing resources. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this need, recent work introduces dynamic deformable convolution to augment regular convolutions. However, this will lead to inefficient memory accesses of inputs with existing hardware. In this work, we harness the flexibility of FPGAs to develop a novel object detection pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a set of algorithm modifications including irregular-access versus limited-range and fixed-shape. We then Co-Design a Network CoDeNet with the modified deformable convolution and quantize it to 4-bit weights and 8-bit activations. With our high-efficiency implementation, our solution reaches 26.9 frames per second with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object detection dataset, Pascal VOC. With our higher accuracy implementation, our model gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters-20.9x smaller but 10% more accurate than Tiny-YOLO.
Deep learning (DL) based object detection has achieved great progress. These methods typically assume that large amount of labeled training data is available, and training and test data are drawn from an identical distribution. However, the two assumptions are not always hold in practice. Deep domain adaptive object detection (DDAOD) has emerged as a new learning paradigm to address the above mentioned challenges. This paper aims to review the state-of-the-art progress on deep domain adaptive object detection approaches. Firstly, we introduce briefly the basic concepts of deep domain adaptation. Secondly, the deep domain adaptive detectors are classified into five categories and detailed descriptions of representative methods in each category are provided. Finally, insights for future research trend are presented.
To achieve lightweight object detectors for deployment on the edge devices, an effective model compression pipeline is proposed in this paper. The compression pipeline consists of automatic channel pruning for the backbone, fixed channel deletion for the branch layers and knowledge distillation for the guidance learning. As results, the Resnet50-v1d is auto-pruned and fine-tuned on ImageNet to attain a compact base model as the backbone of object detector. Then, lightweight object detectors are implemented with proposed compression pipeline. For instance, the SSD-300 with model size=16.3MB, FLOPS=2.31G, and mAP=71.2 is created, revealing a better result than SSD-300-MobileNet.
In this paper we present a new approach for efficient regression based object tracking which we refer to as Deep- LK. Our approach is closely related to the Generic Object Tracking Using Regression Networks (GOTURN) framework of Held et al. We make the following contributions. First, we demonstrate that there is a theoretical relationship between siamese regression networks like GOTURN and the classical Inverse-Compositional Lucas & Kanade (IC-LK) algorithm. Further, we demonstrate that unlike GOTURN IC-LK adapts its regressor to the appearance of the currently tracked frame. We argue that this missing property in GOTURN can be attributed to its poor performance on unseen objects and/or viewpoints. Second, we propose a novel framework for object tracking - which we refer to as Deep-LK - that is inspired by the IC-LK framework. Finally, we show impressive results demonstrating that Deep-LK substantially outperforms GOTURN. Additionally, we demonstrate comparable tracking performance to current state of the art deep-trackers whilst being an order of magnitude (i.e. 100 FPS) computationally efficient.
In this paper, we address the problem of dynamic scene deblurring in the presence of motion blur. Restoration of images affected by severe blur necessitates a network design with a large receptive field, which existing networks attempt to achieve through simple increment in the number of generic convolution layers, kernel-size, or the scales at which the image is processed. However, these techniques ignore the non-uniform nature of blur, and they come at the expense of an increase in model size and inference time. We present a new architecture composed of region adaptive dense deformable modules that implicitly discover the spatially varying shifts responsible for non-uniform blur in the input image and learn to modulate the filters. This capability is complemented by a self-attentive module which captures non-local spatial relationships among the intermediate features and enhances the spatially-varying processing capability. We incorporate these modules into a densely connected encoder-decoder design which utilizes pre-trained Densenet filters to further improve the performance. Our network facilitates interpretable modeling of the spatially-varying deblurring process while dispensing with multi-scale processing and large filters entirely. Extensive comparisons with prior art on benchmark dynamic scene deblurring datasets clearly demonstrate the superiority of the proposed networks via significant improvements in accuracy and speed, enabling almost real-time deblurring.
We propose a universal and physically realizable adversarial attack on a cascaded multi-modal deep learning network (DNN), in the context of self-driving cars. DNNs have achieved high performance in 3D object detection, but they are known to be vulnerable to adversarial attacks. These attacks have been heavily investigated in the RGB image domain and more recently in the point cloud domain, but rarely in both domains simultaneously - a gap to be filled in this paper. We use a single 3D mesh and differentiable rendering to explore how perturbing the meshs geometry and texture can reduce the robustness of DNNs to adversarial attacks. We attack a prominent cascaded multi-modal DNN, the Frustum-Pointnet model. Using the popular KITTI benchmark, we showed that the proposed universal multi-modal attack was successful in reducing the models ability to detect a car by nearly 73%. This work can aid in the understanding of what the cascaded RGB-point cloud DNN learns and its vulnerability to adversarial attacks.