ﻻ يوجد ملخص باللغة العربية
Automatically generating a natural language description of an image is a task close to the heart of image understanding. In this paper, we present a multi-model neural network method closely related to the human visual system that automatically learns to describe the content of images. Our model consists of two sub-models: an object detection and localization model, which extract the information of objects and their spatial relationship in images respectively; Besides, a deep recurrent neural network (RNN) based on long short-term memory (LSTM) units with attention mechanism for sentences generation. Each word of the description will be automatically aligned to different objects of the input image when it is generated. This is similar to the attention mechanism of the human visual system. Experimental results on the COCO dataset showcase the merit of the proposed method, which outperforms previous benchmark models.
Knowledge distillation (KD) has witnessed its powerful ability in learning compact models in deep learning field, but it is still limited in distilling localization information for object detection. Existing KD methods for object detection mainly foc
Despite continuously improving performance, contemporary image captioning models are prone to hallucinating objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fu
Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships betwee
Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes i
Weakly-Supervised Object Detection (WSOD) and Localization (WSOL), i.e., detecting multiple and single instances with bounding boxes in an image using image-level labels, are long-standing and challenging tasks in the CV community. With the success o