ﻻ يوجد ملخص باللغة العربية
This paper presents our approach to the third YouTube-8M video understanding competition that challenges par-ticipants to localize video-level labels at scale to the pre-cise time in the video where the label actually occurs. Ourmodel is an ensemble of frame-level models such as GatedNetVLAD and NeXtVLAD and various BERT models withtest-time augmentation. We explore multiple ways to ag-gregate BERT outputs as video representation and variousways to combine visual and audio information. We proposetest-time augmentation as shifting video frames to one leftor right unit, which adds variety to the predictions and em-pirically shows improvement in evaluation metrics. We firstpre-train the model on the 4M training video-level data, andthen fine-tune the model on 237K annotated video segment-level data. We achieve MAP@100K 0.7871 on private test-ing video segment data, which is ranked 9th over 283 teams.
Machine learning plays an increasingly significant role in many aspects of our lives (including medicine, transportation, security, justice and other domains), making the potential consequences of false predictions increasingly devastating. These con
State-of-the-art video action classifiers often suffer from overfitting. They tend to be biased towards specific objects and scene cues, rather than the foreground action content, leading to sub-optimal generalization performances. Recent data augmen
This paper introduces the system we developed for the Google Cloud & YouTube-8M Video Understanding Challenge, which can be considered as a multi-label classification problem defined on top of the large scale YouTube-8M Dataset. We employ a large set
We report on CMU Informedia Labs system used in Googles YouTube 8 Million Video Understanding Challenge. In this multi-label video classification task, our pipeline achieved 84.675% and 84.662% GAP on our evaluation split and the official test set. W
Convolutional Neural Networks (ConvNets) are trained offline using the few available data and may therefore suffer from substantial accuracy loss when ported on the field, where unseen input patterns received under unpredictable external conditions c