ترغب بنشر مسار تعليمي؟ اضغط هنا

Motion Equivariance OF Event-based Camera Data with the Temporal Normalization Transform

62   0   0.0 ( 0 )
 نشر من قبل Ziyun Wang
 تاريخ النشر 2019
والبحث باللغة English
 تأليف Ziyun Wang




اسأل ChatGPT حول البحث

In this work, we focus on using convolution neural networks (CNN) to perform object recognition on the event data. In object recognition, it is important for a neural network to be robust to the variations of the data during testing. For traditional cameras, translations are well handled because CNNs are naturally equivariant to translations. However, because event cameras record the change of light intensity of an image, the geometric shape of event volumes will not only depend on the objects but also on their relative motions with respect to the camera. The deformation of the events caused by motions causes the CNN to be less robust to unseen motions during inference. To address this problem, we would like to explore the equivariance property of CNNs, a well-studied area that demonstrates to produce predictable deformation of features under certain transformations of the input image.



قيم البحث

اقرأ أيضاً

Event cameras, inspired by biological vision systems, provide a natural and data efficient representation of visual information. Visual information is acquired in the form of events that are triggered by local brightness changes. Each pixel location of the cameras sensor records events asynchronously and independently with very high temporal resolution. However, because most brightness changes are triggered by relative motion of the camera and the scene, the events recorded at a single sensor location seldom correspond to the same world point. To extract meaningful information from event cameras, it is helpful to register events that were triggered by the same underlying world point. In this work we propose a new model of event data that captures its natural spatio-temporal structure. We start by developing a model for aligned event data. That is, we develop a model for the data as though it has been perfectly registered already. In particular, we model the aligned data as a spatio-temporal Poisson point process. Based on this model, we develop a maximum likelihood approach to registering events that are not yet aligned. That is, we find transformations of the observed events that make them as likely as possible under our model. In particular we extract the camera rotation that leads to the best event alignment. We show new state of the art accuracy for rotational velocity estimation on the DAVIS 240C dataset. In addition, our method is also faster and has lower computational complexity than several competing methods.
The fundamental difficulty in person re-identification (ReID) lies in learning the correspondence among individual cameras. It strongly demands costly inter-camera annotations, yet the trained models are not guaranteed to transfer well to previously unseen cameras. These problems significantly limit the application of ReID. This paper rethinks the working mechanism of conventional ReID approaches and puts forward a new solution. With an effective operator named Camera-based Batch Normalization (CBN), we force the image data of all cameras to fall onto the same subspace, so that the distribution gap between any camera pair is largely shrunk. This alignment brings two benefits. First, the trained model enjoys better abilities to generalize across scenarios with unseen cameras as well as transfer across multiple training sets. Second, we can rely on intra-camera annotations, which have been undervalued before due to the lack of cross-camera information, to achieve competitive ReID performance. Experiments on a wide range of ReID tasks demonstrate the effectiveness of our approach. The code is available at https://github.com/automan000/Camera-based-Person-ReID.
Measuring the respiratory signal from a video based on body motion has been proposed and recently matured in products for video health monitoring. The core algorithm for this measurement is the estimation of tiny chest/abdominal motions induced by re spiration, and the fundamental challenge is motion sensitivity. Though prior arts reported on the validation with real human subjects, there is no thorough/rigorous benchmark to quantify the sensitivities and boundary conditions of motion-based core respiratory algorithms that measure sub-pixel displacement between video frames. In this paper, we designed a setup with a fully-controllable physical phantom to investigate the essence of core algorithms, together with a mathematical model incorporating two motion estimation strategies and three spatial representations, leading to six algorithmic combinations for respiratory signal extraction. Their promises and limitations are discussed and clarified via the phantom benchmark. The insights gained in this paper are intended to improve the understanding and applications of camera-based respiration measurement in health monitoring.
Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/Zalex97/VMLoc.
There has been significant amount of research work on human activity classification relying either on Inertial Measurement Unit (IMU) data or data from static cameras providing a third-person view. Using only IMU data limits the variety and complexit y of the activities that can be detected. For instance, the sitting activity can be detected by IMU data, but it cannot be determined whether the subject has sat on a chair or a sofa, or where the subject is. To perform fine-grained activity classification from egocentric videos, and to distinguish between activities that cannot be differentiated by only IMU data, we present an autonomous and robust method using data from both ego-vision cameras and IMUs. In contrast to convolutional neural network-based approaches, we propose to employ capsule networks to obtain features from egocentric video data. Moreover, Convolutional Long Short Term Memory framework is employed both on egocentric videos and IMU data to capture temporal aspect of actions. We also propose a genetic algorithm-based approach to autonomously and systematically set various network parameters, rather than using manual settings. Experiments have been performed to perform 9- and 26-label activity classification, and the proposed method, using autonomously set network parameters, has provided very promising results, achieving overall accuracies of 86.6% and 77.2%, respectively. The proposed approach combining both modalities also provides increased accuracy compared to using only egovision data and only IMU data.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا