No Arabic abstract
Significant effort has been recently devoted to modeling visual relations. This has mostly addressed the design of architectures, typically by adding parameters and increasing model complexity. However, visual relation learning is a long-tailed problem, due to the combinatorial nature of joint reasoning about groups of objects. Increasing model complexity is, in general, ill-suited for long-tailed problems due to their tendency to overfit. In this paper, we explore an alternative hypothesis, denoted the Devil is in the Tails. Under this hypothesis, better performance is achieved by keeping the model simple but improving its ability to cope with long-tailed distributions. To test this hypothesis, we devise a new approach for training visual relationships models, which is inspired by state-of-the-art long-tailed recognition literature. This is based on an iterative decoupled training scheme, denoted Decoupled Training for Devil in the Tails (DT2). DT2 employs a novel sampling approach, Alternating Class-Balanced Sampling (ACBS), to capture the interplay between the long-tailed entity and predicate distributions of visual relations. Results show that, with an extremely simple architecture, DT2-ACBS significantly outperforms much more complex state-of-the-art methods on scene graph generation tasks. This suggests that the development of sophisticated models must be considered in tandem with the long-tailed nature of the problem.
Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image. Models for such problems usually consist of encoders which decrease spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders have been studied rigorously, relatively few studies address the decoder side. This paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise tasks ranging from classification, regression to synthesis. Our contributions are: (1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce new residual-like connections for decoders. (3) We introduce a novel decoder: bilinear additive upsampling. (4) We explore prediction artifacts.
We present a study of the effects of collisional dynamics on the formation and detectability of cold tidal streams. A semi-analytical model for the evolution of the stellar mass function was implemented and coupled to a fast stellar stream simulation code, as well as the synthetic cluster evolution code EMACSS for the mass evolution as a function of a globular cluster orbit. We find that the increase in the average mass of the escaping stars for clusters close to dissolution has a major effect on the observable stream surface density. As an example, we show that Palomar 5 would have undetectable streams (in an SDSS-like survey) if it was currently three times more massive, despite the fact that a more massive cluster loses stars at a higher rate. This bias due to the preferential escape of low-mass stars is an alternative explanation for the absence of tails near massive clusters, than a dark matter halo associated with the cluster. We explore the orbits of a large sample of Milky Way globular clusters and derive their initial masses and remaining mass fraction. Using properties of known tidal tails we explore regions of parameter space that favour the detectability of a stream. A list of high probability candidates is discussed
In this paper we show that by carefully making good choices for various detailed but important factors in a visual recognition framework using deep learning features, one can achieve a simple, efficient, yet highly accurate image classification system. We first list 5 important factors, based on both existing researches and ideas proposed in this paper. These important detailed factors include: 1) $ell_2$ matrix normalization is more effective than unnormalized or $ell_2$ vector normalization, 2) the proposed natural deep spatial pyramid is very effective, and 3) a very small $K$ in Fisher Vectors surprisingly achieves higher accuracy than normally used large $K$ values. Along with other choices (convolutional activations and multiple scales), the proposed DSP framework is not only intuitive and efficient, but also achieves excellent classification accuracy on many benchmark datasets. For example, DSPs accuracy on SUN397 is 59.78%, significantly higher than previous state-of-the-art (53.86%).
By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On the one hand, this is desirable as it treats all classes, rare to frequent, equally. On the other hand, it ignores cross-category confidence calibration, a key property in real-world use cases. Unfortunately, we find that on imbalanced, large-vocabulary datasets, the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors. In fact, we show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin. To address these limitations, we introduce two complementary metrics. First, we present a simple fix to the default AP implementation, ensuring that it is truly independent across categories as originally intended. We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation, suggesting recent improvements may arise from difficult to interpret changes to cross-category rankings. Given the importance of reliably benchmarking cross-category rankings, we consider a pooled version of AP (AP-pool) that rewards properly calibrated detectors by directly comparing cross-category rankings. Finally, we revisit classical approaches for calibration and find that explicitly calibrating detectors improves state-of-the-art on AP-pool by 1.7 points.
We investigate opinion dynamics in multi-agent networks when a bias toward one of two possible opinions exists; for example, reflecting a status quo vs a superior alternative. Starting with all agents sharing an initial opinion representing the status quo, the system evolves in steps. In each step, one agent selected uniformly at random adopts the superior opinion with some probability $alpha$, and with probability $1 - alpha$ it follows an underlying update rule to revise its opinion on the basis of those held by its neighbors. We analyze convergence of the resulting process under two well-known update rules, namely majority and voter. The framework we propose exhibits a rich structure, with a non-obvious interplay between topology and underlying update rule. For example, for the voter rule we show that the speed of convergence bears no significant dependence on the underlying topology, whereas the picture changes completely under the majority rule, where network density negatively affects convergence. We believe that the model we propose is at the same time simple, rich, and modular, affording mathematical characterization of the interplay between bias, underlying opinion dynamics, and social structure in a unified setting.