Semantic tracking: Single-target tracking with inter-supervised convolutional networks

96 0 0.0 ( 0 )

Download Cite

Added by Jingjing Xiao

Publication date 2016

fields Informatics Engineering

and research's language is English

Authors Jingjing Xiao - Qiang Lan - Linbo Qiao

Computer Vision and Pattern Recognition

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This article presents a semantic tracker which simultaneously tracks a single target and recognises its category. In general, it is hard to design a tracking model suitable for all object categories, e.g., a rigid tracker for a car is not suitable for a deformable gymnast. Category-based trackers usually achieve superior tracking performance for the objects of that specific category, but have difficulties being generalised. Therefore, we propose a novel unified robust tracking framework which explicitly encodes both generic features and category-based features. The tracker consists of a shared convolutional network (NetS), which feeds into two parallel networks, NetC for classification and NetT for tracking. NetS is pre-trained on ImageNet to serve as a generic feature extractor across the different object categories for NetC and NetT. NetC utilises those features within fully connected layers to classify the object category. NetT has multiple branches, corresponding to multiple categories, to distinguish the tracked object from the background. Since each branch in NetT is trained by the videos of a specific category or groups of similar categories, NetT encodes category-based features for tracking. During online tracking, NetC and NetT jointly determine the target regions with the right category and foreground labels for target estimation. To improve the robustness and precision, NetC and NetT inter-supervise each other and trigger network adaptation when their outputs are ambiguous for the same image regions (i.e., when the category label contradicts the foreground/background classification). We have compared the performance of our tracker to other state-of-the-art trackers on a large-scale tracking benchmark (100 sequences)---the obtained results demonstrate the effectiveness of our proposed tracker as it outperformed other 38 state-of-the-art tracking algorithms.

rate research

Deep Tracking: Visual Tracking Using Deep Convolutional Networks

180 - Meera Hahn , Si Chen , Afshin Dehghan 2015

In this paper, we study a discriminatively trained deep convolutional network for the task of visual tracking. Our tracker utilizes both motion and appearance features that are extracted from a pre-trained dual stream deep convolution network. We show that the features extracted from our dual-stream network can provide rich information about the target and this leads to competitive performance against state of the art tracking methods on a visual tracking benchmark.

Computer Vision and Pattern Recognition

Target Transformed Regression for Accurate Tracking

123 - Yutao Cui , Cheng Jiang , Limin Wang 2021

Accurate tracking is still a challenging task due to appearance variations, pose and view changes, and geometric deformations of target in videos. Recent anchor-free trackers provide an efficient regression mechanism but fail to produce precise bounding box estimation. To address these issues, this paper repurposes a Transformer-alike regression branch, termed as Target Transformed Regression (TREG), for accurate anchor-free tracking. The core to our TREG is to model pair-wise relation between elements in target template and search region, and use the resulted target enhanced visual representation for accurate bounding box regression. This target contextualized representation is able to enhance the target relevant information to help precisely locate the box boundaries, and deal with the object deformation to some extent due to its local and dense matching mechanism. In addition, we devise a simple online template update mechanism to select reliable templates, increasing the robustness for appearance variations and geometric deformations of target in time. Experimental results on visual tracking benchmarks including VOT2018, VOT2019, OTB100, GOT10k, NFS, UAV123, LaSOT and TrackingNet demonstrate that TREG obtains the state-of-the-art performance, achieving a success rate of 0.640 on LaSOT, while running at around 30 FPS. The code and models will be made available at https://github.com/MCG-NJU/TREG.

Computer Vision and Pattern Recognition

Single-Target Real-Time Passive WiFi Tracking

364 - Zhongqin Wang , J. Andrew Zhang , Min Xu 2021

Device-free human tracking is an essential ingredient for ubiquitous wireless sensing. Recent passive WiFi tracking systems face the challenges of inaccurate separation of dynamic human components and time-consuming estimation of multi-dimensional signal parameters. In this work, we present a scheme named WiFi Doppler Frequency Shift (WiDFS), which can achieve single-target real-time passive tracking using channel state information (CSI) collected from commercial-off-the-shelf (COTS) WiFi devices. We consider the typical system setup including a transmitter with a single antenna and a receiver with three antennas; while our scheme can be readily extended to another setup. To remove the impact of transceiver asynchronization, we first apply CSI cross-correlation between each RX antenna pair. We then combine them to estimate a Doppler frequency shift (DFS) in a short-time window. After that, we leverage the DFS estimate to separate dynamic human components from CSI self-correlation terms of each antenna, thereby separately calculating angle-of-arrival (AoA) and human reflection distance for tracking. In addition, a hardware calibration algorithm is presented to refine the spacing between RX antennas and eliminate the hardware-related phase differences between them. A prototype demonstrates that WiDFS can achieve real-time tracking with a median position error of 72.32 cm in multipath-rich environments.

Networking and Internet Architecture Human-Computer Interaction

Self-supervised Moving Vehicle Tracking with Stereo Sound

82 - Chuang Gan , Hang Zhao , Peihao Chen 2019

Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audio-visual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground-truth annotations. In particular, we propose a framework that consists of a vision teacher network and a stereo-sound student network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Au-ditory Vehicle Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

Computer Vision and Pattern Recognition Machine Learning Sound

Learning Optimal Parameters for Multi-target Tracking with Contextual Interactions

69 - Shaofei Wang , Charless C. Fowlkes 2016

We describe an end-to-end framework for learning parameters of min-cost flow multi-target tracking problem with quadratic trajectory interactions including suppression of overlapping tracks and contextual cues about cooccurrence of different objects. Our approach utilizes structured prediction with a tracking-specific loss function to learn the complete set of model parameters. In this learning framework, we evaluate two different approaches to finding an optimal set of tracks under a quadratic model objective, one based on an LP relaxation and the other based on novel greedy variants of dynamic programming that handle pairwise interactions. We find the greedy algorithms achieve almost equivalent accuracy to the LP relaxation while being up to 10x faster than a commercial LP solver. We evaluate trained models on three challenging benchmarks. Surprisingly, we find that with proper parameter learning, our simple data association model without explicit appearance/motion reasoning is able to achieve comparable or better accuracy than many state-of-the-art methods that use far more complex motion features or appearance affinity metric learning.

Computer Vision and Pattern Recognition