No Arabic abstract
Across a wide range of applications, from autonomous vehicles to medical imaging, multi-spectral images provide an opportunity to extract additional information not present in color images. One of the most important steps in making this information readily available is the accurate estimation of dense correspondences between different spectra. Due to the nature of cross-spectral images, most correspondence solving techniques for the visual domain are simply not applicable. Furthermore, most cross-spectral techniques utilize spectra-specific characteristics to perform the alignment. In this work, we aim to address the dense correspondence estimation problem in a way that generalizes to more than one spectrum. We do this by introducing a novel cycle-consistency metric that allows us to self-supervise. This, combined with our spectra-agnostic loss functions, allows us to train the same network across multiple spectra. We demonstrate our approach on the challenging task of dense RGB-FIR correspondence estimation. We also show the performance of our unmodified network on the cases of RGB-NIR and RGB-RGB, where we achieve higher accuracy than similar self-supervised approaches. Our work shows that cross-spectral correspondence estimation can be solved in a common framework that learns to generalize alignment across spectra.
Translations between the quantum circuit model and the measurement-based one-way model are useful for verification and optimisation of quantum computations. They make crucial use of a property known as gflow. While gflow is defined for one-way computations allowing measurements in three different planes of the Bloch sphere, most research so far has focused on computations containing only measurements in the XY-plane. Here, we give the first circuit-extraction algorithm to work for one-way computations containing measurements in all three planes and having gflow. The algorithm is efficient and the resulting circuits do not contain ancillae. One-way computations are represented using the ZX-calculus, hence the algorithm also represents the most general known procedure for extracting circuits from ZX-diagrams. In developing this algorithm, we generalise several concepts and results previously known for computations containing only XY-plane measurements. We bring together several known rewrite rules for measurement patterns and formalise them in a unified notation using the ZX-calculus. These rules are used to simplify measurement patterns by reducing the number of qubits while preserving both the semantics and the existence of gflow. The results can be applied to circuit optimisation by translating circuits to patterns and back again.
We prove that the evidence lower bound (ELBO) employed by variational auto-encoders (VAEs) admits non-trivial solutions having constant posterior variances under certain mild conditions, removing the need to learn variances in the encoder. The proof follows from an unexpected journey through an array of topics: the closed form optimal decoder for Gaussian VAEs, a proof that the decoder is always smooth, a proof that the ELBO at its stationary points is equal to the exact log evidence, and the posterior variance is merely part of a stochastic estimator of the decoder Hessian. The penalty incurred from using a constant posterior variance is small under mild conditions, and otherwise discourages large variations in the decoder Hessian. From here we derive a simplified formulation of the ELBO as an expectation over a batch, which we call the Batch Information Lower Bound (BILBO). Despite the use of Gaussians, our analysis is broadly applicable -- it extends to any likelihood function that induces a Riemannian metric. Regarding learned likelihoods, we show that the ELBO is optimal in the limit as the likelihood variances approach zero, where it is equivalent to the change of variables formulation employed in normalizing flow networks. Standard optimization procedures are unstable in this limit, so we propose a bounded Gaussian likelihood that is invariant to the scale of the data using a measure of the aggregate information in a batch, which we call Bounded Aggregate Information Sampling (BAGGINS). Combining the two formulations, we construct VAE networks with only half the outputs of ordinary VAEs (no learned variances), yielding improved ELBO scores and scale invariance in experiments. As we perform our analyses irrespective of any particular network architecture, our reformulations may apply to any VAE implementation.
In this paper, we focus on the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. The intra-video learning transforms the image contents across frames within a single video via the frame pair-wise affinity. To obtain the discriminative representation for instance-level separation, we go beyond the intra-video analysis and construct the inter-video affinity to facilitate the contrastive transformation across different videos. By forcing the transformation consistency between intra- and inter-video levels, the fine-grained correspondence associations are well preserved and the instance-level feature discrimination is effectively reinforced. Our simple framework outperforms the recent self-supervised correspondence methods on a range of visual tasks including video object tracking (VOT), video object segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that our method also surpasses the fully-supervised affinity representation (e.g., ResNet) and performs competitively against the recent fully-supervised algorithms designed for the specific tasks (e.g., VOT and VOS).
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions emph{and} establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to teach the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.