Do you want to publish a course? Click here

Given an untrimmed video and a natural language query, Natural Language Video Localization (NLVL) aims to identify the video moment described by query. To address this task, existing methods can be roughly grouped into two groups: 1) propose-and-rank models first define a set of hand-designed moment candidates and then find out the best-matching one. 2) proposal-free models directly predict two temporal boundaries of the referential moment from frames. Currently, almost all the propose-and-rank methods have inferior performance than proposal-free counterparts. In this paper, we argue that the performance of propose-and-rank models are underestimated due to the predefined manners: 1) Hand-designed rules are hard to guarantee the complete coverage of targeted segments. 2) Densely sampled candidate moments cause redundant computation and degrade the performance of ranking process. To this end, we propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals. The position and length of these proposals are dynamically adjusted during training process. Moreover, a boundary-aware loss has been proposed to leverage frame-level information and further improve performance. Extensive ablations on two challenging NLVL benchmarks have demonstrated the effectiveness of LPNet over existing state-of-the-art methods.
Live video comments, or ''danmu'', are an emerging feature on Asian online video platforms. Danmu are time-synchronous comments that are overlaid on a video playback. These comments uniquely enrich the experience and engagement of their users. These comments have become a determining factor in the popularity of the videos. Similar to the ''cold start problem'' in recommender systems, a video will only start to attract attention when sufficient danmu comments have been posted on it. We study this video cold start problem and examine how new comments can be generated automatically on less-commented videos. We propose to predict the danmu comments by exploiting a multi-modal combination of the video visual content, subtitles, audio signals, and any surrounding comments (when they exist). Our method fuses these multi-modalities in a transformer network which is then trained for different comment density scenarios. We evaluate our proposed system through both a retrieval based evaluation method, as well as human judgement. Results show that our proposed system improves significantly over state-of-the-art methods.
Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many video-and-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DeCEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss.
الاستيعاب الشفهي هو مرحلة أساسية في اكتساب لغة أجنبية. لجعل تعليم و تعلّم هذه المهارة أسهل و أكثر فائدة في عملية الاستيعاب الشفهي، إنّ المستند السمعي البصري هو الوسيط الذي يحفز الطلاب و يحثهم على بناء معنى الرسالة الصوتية تدريجياً. ندرس في هذا المقا ل العناصر التي تساعد في فهم المستند السمعي البصري الذي يتيح الوصول الى المعنى من خلال الإشارات الغير شفهية، و لا سيما في مرحلة ما قبل الاستماع الذي يُعد الخطوة الأولى نحو فهم الرسالة الصوتية. بهدف مساعدة الطلاب على تحسين استيعابهم الشفهي، قمنا بدراسة أهمية المستندات السمعية البصرية لإثارة انتباه الطلاب. كما قمنا باستخدام هذه المستندات لتحفيز طلاب الصف العاشر، و هم العينة التي قمنا بتطبيق المنهج التعليمي الخاص باستخدام مستند سمعي بصري كمستند محفّز و داعم للاستيعاب الشفهي، و اختتمنا مع تحليل نتائج التلاميذ الذين حضروا درسنا و مقارنتها مع نتائج أولئك الذين لم يحضروا.
In this work, our goal is recognizing human action from video data. First we propose an overview about Human Action Recognition, includes the famous methods and previous algorithms, then we propose an algorithm and its implementation using MATLAB.
Developing 3D Online Game and Interactive Virtual Environment Control System with unity game engine this project is for graduation project in damascus university
This study was conducted to estimate the losses due to mastitis in Fedio Station –Syria, by using a direct microscopic method to determine the somatic cell count in bulk tank milk. About 52 milk bulk tank samples were examined weekly in one year.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا