ﻻ يوجد ملخص باللغة العربية
Multi-modal information is essential to describe what has happened in a video. In this work, we represent videos by various appearance, motion and audio information guided with video topic. By following multi-stage training strategy, our experiments show steady and significant improvement on the VATEX benchmark. This report presents an overview and comparative analysis of our system designed for both Chinese and English tracks on VATEX Captioning Challenge 2019.
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages. We identified three crucial factors that improve the performance, namely: multi-v
Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video c
While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of C
Medical image captioning automatically generates a medical description to describe the content of a given medical image. A traditional medical image captioning model creates a medical description only based on a single medical image input. Hence, an
Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object detection networ