No Arabic abstract
Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and the features of the target template and search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Siamese Attention Networks, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self attention learns strong context information via spatial attention, and selectively emphasizes interdependent channel-wise features with channel attention. The cross-attention is capable of aggregating rich contextual inter-dependencies between the target template and the search image, providing an implicit manner to adaptively update the target template. In addition, we design a region refinement module that computes depth-wise cross correlations between the attentional features for more accurate tracking. We conduct experiments on six benchmarks, where our method achieves new state of-the-art results, outperforming the strong baseline, SiamRPN++ [24], by 0.464->0.537 and 0.415->0.470 EAO on VOT 2016 and 2018. Our code is available at: https://github.com/msight-tech/research-siamattn.
The current Siamese network based on region proposal network (RPN) has attracted great attention in visual tracking due to its excellent accuracy and high efficiency. However, the design of the RPN involves the selection of the number, scale, and aspect ratios of anchor boxes, which will affect the applicability and convenience of the model. Furthermore, these anchor boxes require complicated calculations, such as calculating their intersection-over-union (IoU) with ground truth bounding boxes.Due to the problems related to anchor boxes, we propose a simple yet effective anchor-free tracker (named Siamese corner networks, SiamCorners), which is end-to-end trained offline on large-scale image pairs. Specifically, we introduce a modified corner pooling layer to convert the bounding box estimate of the target into a pair of corner predictions (the bottom-right and the top-left corners). By tracking a target as a pair of corners, we avoid the need to design the anchor boxes. This will make the entire tracking algorithm more flexible and simple than anchorbased trackers. In our network design, we further introduce a layer-wise feature aggregation strategy that enables the corner pooling module to predict multiple corners for a tracking target in deep networks. We then introduce a new penalty term that is used to select an optimal tracking box in these candidate corners. Finally, SiamCorners achieves experimental results that are comparable to the state-of-art tracker while maintaining a high running speed. In particular, SiamCorners achieves a 53.7% AUC on NFS30 and a 61.4% AUC on UAV123, while still running at 42 frames per second (FPS).
Recently spiking neural networks (SNNs), the third-generation of neural networks has shown remarkable capabilities of energy-efficient computing, which is a promising alternative for deep neural networks (DNNs) with high energy consumption. SNNs have reached competitive results compared to DNNs in relatively simple tasks and small datasets such as image classification and MNIST/CIFAR, while few studies on more challenging vision tasks on complex datasets. In this paper, we focus on extending deep SNNs to object tracking, a more advanced vision task with embedded applications and energy-saving requirements, and present a spike-based Siamese network called SiamSNN. Specifically, we propose an optimized hybrid similarity estimation method to exploit temporal information in the SNNs, and introduce a novel two-status coding scheme to optimize the temporal distribution of output spike trains for further improvements. SiamSNN is the first deep SNN tracker that achieves short latency and low precision loss on the visual object tracking benchmarks OTB2013/2015, VOT2016/2018, and GOT-10k. Moreover, SiamSNN achieves notably low energy consumption and real-time on Neuromorphic chip TrueNorth.
Most of the existing trackers usually rely on either a multi-scale searching scheme or pre-defined anchor boxes to accurately estimate the scale and aspect ratio of a target. Unfortunately, they typically call for tedious and heuristic configurations. To address this issue, we propose a simple yet effective visual tracking framework (named Siamese Box Adaptive Network, SiamBAN) by exploiting the expressive power of the fully convolutional network (FCN). SiamBAN views the visual tracking problem as a parallel classification and regression problem, and thus directly classifies objects and regresses their bounding boxes in a unified FCN. The no-prior box design avoids hyper-parameters associated with the candidate boxes, making SiamBAN more flexible and general. Extensive experiments on visual tracking benchmarks including VOT2018, VOT2019, OTB100, NFS, UAV123, and LaSOT demonstrate that SiamBAN achieves state-of-the-art performance and runs at 40 FPS, confirming its effectiveness and efficiency. The code will be available at https://github.com/hqucv/siamban.
In this paper, we focus on improving online multi-object tracking (MOT). In particular, we introduce a region-based Siamese Multi-Object Tracking network, which we name SiamMOT. SiamMOT includes a motion model that estimates the instances movement between two frames such that detected instances are associated. To explore how the motion modelling affects its tracking capability, we present two variants of Siamese tracker, one that implicitly models motion and one that models it explicitly. We carry out extensive quantitative experiments on three different MOT datasets: MOT17, TAO-person and Caltech Roadside Pedestrians, showing the importance of motion modelling for MOT and the ability of SiamMOT to substantially outperform the state-of-the-art. Finally, SiamMOT also outperforms the winners of ACM MM20 HiEve Grand Challenge on HiEve dataset. Moreover, SiamMOT is efficient, and it runs at 17 FPS for 720P videos on a single modern GPU. Codes are available in url{https://github.com/amazon-research/siam-mot}.
Visual tracking plays an important role in perception system, which is a crucial part of intelligent transportation. Recently, Siamese network is a hot topic for visual tracking to estimate moving targets trajectory, due to its superior accuracy and simple framework. In general, Siamese tracking algorithms, supervised by logistic loss and triplet loss, increase the value of inner product between exemplar template and positive sample while reduce the value of inner product with background sample. However, the distractors from different exemplars are not considered by mentioned loss functions, which limit the feature models discrimination. In this paper, a new exemplar loss integrated with logistic loss is proposed to enhance the feature models discrimination by reducing inner products among exemplars. Without the bells and whistles, the proposed algorithm outperforms the methods supervised by logistic loss or triplet loss. Numerical results suggest that the newly developed algorithm achieves comparable performance in public benchmarks.