ﻻ يوجد ملخص باللغة العربية
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE$infty$. Without bells and whistles, VSE$infty$ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE$infty$ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models. Code and pre-trained models are available at https://vse-infty.github.io.
For visual-semantic embedding, the existing methods normally treat the relevance between queries and candidates in a bipolar way -- relevant or irrelevant, and all irrelevant candidates are uniformly pushed away from the query by an equal margin in t
Person reidentification (ReID) is a very hot research topic in machine learning and computer vision, and many person ReID approaches have been proposed; however, most of these methods assume that the same person has the same clothes within a short ti
Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained mo
Few-shot learning aims at rapidly adapting to novel categories with only a handful of samples at test time, which has been predominantly tackled with the idea of meta-learning. However, meta-learning approaches essentially learn across a variety of f
In this work, we evaluate the use of superpixel pooling layers in deep network architectures for semantic segmentation. Superpixel pooling is a flexible and efficient replacement for other pooling strategies that incorporates spatial prior informatio