ﻻ يوجد ملخص باللغة العربية
This paper addresses the sensitivity of neural image caption generators to their visual input. A sensitivity analysis and omission analysis based on image foils is reported, showing that the extent to which image captioning architectures retain and are sensitive to visual information varies depending on the type of word being generated and the position in the caption as a whole. We motivate this work in the context of broader goals in the field to achieve more explainability in AI.
A neural language model can be conditioned into generating descriptions for images by providing visual information apart from the sentence prefix. This visual information can be included into the language model through different points of entry resul
When a recurrent neural network language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN -- conditioning the language model by `injecting image features -- or
Image caption generation systems are typically evaluated against reference outputs. We show that it is possible to predict output quality without generating the captions, based on the probability assigned by the neural model to the reference captions
Recurrent neural networks (RNNs) are widely used as a memory model for sequence-related problems. Many variants of RNN have been proposed to solve the gradient problems of training RNNs and process long sequences. Although some classical models have
Entropy-based measures are an important tool for studying human gaze behavior under various conditions. In particular, gaze transition entropy (GTE) is a popular method to quantify the predictability of fixation transitions. However, GTE does not acc