Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge

58 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tianqi Liu

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tianqi Liu - Bo Liu

الرؤية الحاسوبية وتمييز الأنماط

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper presents our 7th place solution to the second YouTube-8M video understanding competition which challenges participates to build a constrained-size model to classify millions of YouTube videos into thousands of classes. Our final model consists of four single models aggregated into one tensorflow graph. For each single model, we use the same network architecture as in the winning solution of the first YouTube-8M video understanding competition, namely Gated NetVLAD. We train the single models separately in tensorflows default float32 precision, then replace weights with float16 precision and ensemble them in the evaluation and inference stages., achieving 48.5% compression rate without loss of precision. Our best model achieved 88.324% GAP on private leaderboard. The code is publicly available at https://github.com/boliu61/youtube-8m

قيم البحث

113 - Zhenzhi Wang , Liyu Wu , Zhimin Li 2021

Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. Our challenge includes two tasks: video structuring in the temporal dimension and multi-modal video classification. It asks th e participants to accurately predict both the scene boundaries and the multi-label categories of each scene based on a fine-grained and ads-related category hierarchy. Therefore, our task has four distinguishing features from previous ones: ads domain, multi-modal information, temporal segmentation, and multi-label classification. It will advance the foundation of ads video understanding and have a significant impact on many ads applications like video recommendation. This paper presents an overview of our challenge, including the background of ads videos, an elaborate description of task and dataset, evaluation protocol, and our proposed baseline. By ablating the key components of our baseline, we would like to reveal the main challenges of this task and provide useful guidance for future research of this area. In this paper, we give an extended version of our challenge overview. The dataset will be publicly available at https://algo.qq.com/.

الرؤية الحاسوبية وتمييز الأنماط

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

312 - Samuel Albanie , Yang Liu , Arsha Nagrani 2020

We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the results of the first edition of the challenge together with the findings of the participants.

الرؤية الحاسوبية وتمييز الأنماط

Adaptive Intermediate Representations for Video Understanding

64 - Juhana Kangaspunta , AJ Piergiovanni , Rico Jonschkowski 2021

A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow. In this work, we introduce a new way to leverage semantic segmentation as an intermediate represent ation for video understanding and use it in a way that requires no additional labeling. Second, we propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task and allows the adaptation of the representations to the end goal. Despite the use of intermediate representations within the network, during inference, no additional data beyond RGB sequences is needed, enabling efficient recognition with a single network. Finally, we present a way to find the optimal learning configuration by searching the best loss weighting via evolution. We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.

الرؤية الحاسوبية وتمييز الأنماط

Customized video filtering on YouTube

154 - Vishal Anand , Ravi Shukla , Ashwani Gupta 2019

Inappropriate and profane content on social media is exponentially increasing and big corporations are becoming more aware of the type of content on which they are advertising and how it may affect their brand reputation. But with a huge surge in con tent being posted online it becomes seemingly difficult to filter out related videos on which they can run their ads without compromising brand name. Advertising on youtube videos generates a huge amount of revenue for corporations. It becomes increasingly important for such corporations to advertise on only the videos that dont hurt the feelings, community or harmony of the audience at large. In this paper, we propose a system to identify inappropriate content on YouTube and leverage it to perform a first of its kind, large scale, quantitative characterization that reveals some of the risks of YouTube ads consumption on inappropriate videos. Customization of the architecture have also been included to serve different requirements of corporations. Our analysis reveals that YouTube is still plagued by such disturbing videos and its currently deployed countermeasures are ineffective in terms of detecting them in a timely manner. Our framework tries to fill this gap by providing a handy, add on solution to filter the videos and help corporations and companies to push ads on the platform without worrying about the content on which the ads are displayed.

التعلم الآلي أجهزة الكمبيوتر والمجتمع استرجاع المعلومات

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

70 - Mathew Monfort , Kandan Ramakrishnan , Alex Andonian 2019

An event happening in the world is often made of different activities and actions that can unfold simultaneously or sequentially within a few seconds. However, most large-scale datasets built to train models for action recognition provide a single la bel per video clip. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information that would be mandatory to more completely comprehend different events and eventually learn causality between them. Towards this goal, we augmented the existing video dataset, Moments in Time (MiT), to include over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning and provide improved methods for visualizing and interpreting models trained for multi-label action detection.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي معالجة الصور والفيديو

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الوادي الدولية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً