Do you want to publish a course? Click here

STA-VPR: Spatio-temporal Alignment for Visual Place Recognition

78   0   0.0 ( 0 )
 Added by Feng Lu
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

Recently, the methods based on Convolutional Neural Networks (CNNs) have gained popularity in the field of visual place recognition (VPR). In particular, the features from the middle layers of CNNs are more robust to drastic appearance changes than handcrafted features and high-layer features. Unfortunately, the holistic mid-layer features lack robustness to large viewpoint changes. Here we split the holistic mid-layer features into local features, and propose an adaptive dynamic time warping (DTW) algorithm to align local features from the spatial domain while measuring the distance between two images. This realizes viewpoint-invariant and condition-invariant place recognition. Meanwhile, a local matching DTW (LM-DTW) algorithm is applied to perform image sequence matching based on temporal alignment, which achieves further improvements and ensures linear time complexity. We perform extensive experiments on five representative VPR datasets. The results show that the proposed method significantly improves the CNN-based methods. Moreover, our method outperforms several state-of-the-art methods while maintaining good run-time performance. This work provides a novel way to boost the performance of CNN methods without any re-training for VPR. The code is available at https://github.com/Lu-Feng/STA-VPR.



rate research

Read More

104 - Zhe Xin , Yinghao Cai , Tao Lu 2019
We address the problem of visual place recognition with perceptual changes. The fundamental problem of visual place recognition is generating robust image representations which are not only insensitive to environmental changes but also distinguishable to different places. Taking advantage of the feature extraction ability of Convolutional Neural Networks (CNNs), we further investigate how to localize discriminative visual landmarks that positively contribute to the similarity measurement, such as buildings and vegetations. In particular, a Landmark Localization Network (LLN) is designed to indicate which regions of an image are used for discrimination. Detailed experiments are conducted on open source datasets with varied appearance and viewpoint changes. The proposed approach achieves superior performance against state-of-the-art methods.
This paper focuses on two key problems for audio-visual emotion recognition in the video. One is the audio and visual streams temporal alignment for feature level fusion. The other one is locating and re-weighting the perception attentions in the whole audio-visual stream for better recognition. The Long Short Term Memory Recurrent Neural Network (LSTM-RNN) is employed as the main classification architecture. Firstly, soft attention mechanism aligns the audio and visual streams. Secondly, seven emotion embedding vectors, which are corresponding to each classification emotion type, are added to locate the perception attentions. The locating and re-weighting process is also based on the soft attention mechanism. The experiment results on EmotiW2015 dataset and the qualitative analysis show the efficiency of the proposed two techniques.
In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.
Place recognition is indispensable for a drift-free localization system. Due to the variations of the environment, place recognition using single-modality has limitations. In this paper, we propose a bi-modal place recognition method, which can extract a compound global descriptor from the two modalities, vision and LiDAR. Specifically, we first build the elevation image generated from 3D points as a structural representation. Then, we derive the correspondences between 3D points and image pixels that are further used in merging the pixel-wise visual features into the elevation map grids. In this way, we fuse the structural features and visual features in the consistent bird-eye view frame, yielding a semantic representation, namely CORAL. And the whole network is called CORAL-VLAD. Comparisons on the Oxford RobotCar show that CORAL-VLAD has superior performance against other state-of-the-art methods. We also demonstrate that our network can be generalized to other scenes and sensor configurations on cross-city datasets.
We propose a methodology for robust, real-time place recognition using an imaging lidar, which yields image-quality high-resolution 3D point clouds. Utilizing the intensity readings of an imaging lidar, we project the point cloud and obtain an intensity image. ORB feature descriptors are extracted from the image and encoded into a bag-of-words vector. The vector, used to identify the point cloud, is inserted into a database that is maintained by DBoW for fast place recognition queries. The returned candidate is further validated by matching visual feature descriptors. To reject matching outliers, we apply PnP, which minimizes the reprojection error of visual features positions in Euclidean space with their correspondences in 2D image space, using RANSAC. Combining the advantages from both camera and lidar-based place recognition approaches, our method is truly rotation-invariant, and can tackle reverse revisiting and upside down revisiting. The proposed method is evaluated on datasets gathered from a variety of platforms over different scales and environments. Our implementation and datasets are available at https://git.io/image-lidar
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا