بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Non-local NetVLAD Encoding for Video Classification

338 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yongyi Tang

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yongyi Tang - Xing Zhang - Jingwen Wang

الرؤية الحاسوبية وتمييز الأنماط

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper describes our solution for the 2$^text{nd}$ YouTube-8M video understanding challenge organized by Google AI. Unlike the video recognition benchmarks, such as Kinetics and Moments, the YouTube-8M challenge provides pre-extracted visual and audio features instead of raw videos. In this challenge, the submitted model is restricted to 1GB, which encourages participants focus on constructing one powerful single model rather than incorporating of the results from a bunch of models. Our system fuses six different sub-models into one single computational graph, which are categorized into three families. More specifically, the most effective family is the model with non-local operations following the NetVLAD encoding. The other two family models are Soft-BoF and GRU, respectively. In order to further boost single models performance, the model parameters of different checkpoints are averaged. Experimental results demonstrate that our proposed system can effectively perform the video classification task, achieving 0.88763 on the public test set and 0.88704 on the private set in terms of GAP@20, respectively. We finally ranked at the fourth place in the YouTube-8M video understanding challenge.

قيم البحث

58 - Itir Onal Ertugrul , Mete Ozay , Fatos T. Yarman Vural 2016

In this work, we propose a novel framework to encode the local connectivity patterns of brain, using Fisher Vectors (FV), Vector of Locally Aggregated Descriptors (VLAD) and Bag-of-Words (BoW) methods. We first obtain local descriptors, called Mesh A rc Descriptors (MADs) from fMRI data, by forming local meshes around anatomical regions, and estimating their relationship within a neighborhood. Then, we extract a dictionary of relationships, called textit{brain connectivity dictionary} by fitting a generative Gaussian mixture model (GMM) to a set of MADs, and selecting the codewords at the mean of each component of the mixture. Codewords represent the connectivity patterns among anatomical regions. We also encode MADs by VLAD and BoW methods using the k-Means clustering. We classify the cognitive states of Human Connectome Project (HCP) task fMRI dataset, where we train support vector machines (SVM) by the encoded MADs. Results demonstrate that, FV encoding of MADs can be successfully employed for classification of cognitive tasks, and outperform the VLAD and BoW representations. Moreover, we identify the significant Gaussians in mixture models by computing energy of their corresponding FV parts, and analyze their effect on classification accuracy. Finally, we suggest a new method to visualize the codewords of brain connectivity dictionary.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Deep Multimodal Feature Encoding for Video Ordering

147 - Vivek Sharma , Makarand Tapaswi , Rainer Stiefelhagen 2020

True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes a ll these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the Large Scale Movie Description Challenge. We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the performance of many applications.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي الوسائط المتعددة

Composable Augmentation Encoding for Video Representation Learning

126 - Chen Sun , Arsha Nagrani , Yonglong Tian 2021

We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negat ives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an augmentation aware contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. We show that representations learned by our method encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.

الرؤية الحاسوبية وتمييز الأنماط

Memory-Augmented Non-Local Attention for Video Super-Resolution

92 - Jiyang Yu , Jingen Liu , Liefeng Bo 2021

In this paper, we propose a novel video super-resolution method that aims at generating high-fidelity high-resolution (HR) videos from low-resolution (LR) ones. Previous methods predominantly leverage temporal neighbor frames to assist the super-reso lution of the current frame. Those methods achieve limited performance as they suffer from the challenge in spatial frame alignment and the lack of useful information from similar LR neighbor frames. In contrast, we devise a cross-frame non-local attention mechanism that allows video super-resolution without frame alignment, leading to be more robust to large motions in the video. In addition, to acquire the information beyond neighbor frames, we design a novel memory-augmented attention module to memorize general video details during the super-resolution training. Experimental results indicate that our method can achieve superior performance on large motion videos comparing to the state-of-the-art methods without aligning frames. Our source code will be released.

الرؤية الحاسوبية وتمييز الأنماط

Circulant temporal encoding for video retrieval and temporal alignment

633 - Matthijs Douze , Jer^ome Revaud , Jakob Verbeek 2015

We address the problem of specific video event retrieval. Given a query video of a specific event, e.g., a concert of Madonna, the goal is to retrieve other videos of the same event that temporally overlap with the query. Our approach encodes the fra me descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to efficiently compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. The descriptors can be compressed in the frequency domain with a product quantizer adapted to complex numbers. In this case, video retrieval is performed without decompressing the descriptors. We also consider the temporal alignment of a set of videos. We exploit the matching confidence and an estimate of the temporal offset computed for all pairs of videos by our retrieval approach. Our robust algorithm aligns the videos on a global timeline by maximizing the set of temporally consistent matches. The global temporal alignment enables synchronous playback of the videos of a given scene.

الرؤية الحاسوبية وتمييز الأنماط

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة المأمون الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Non-local NetVLAD Encoding for Video Classification

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً