No Arabic abstract
Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teachers representation to a student network. In this paper, we propose to use a metric between representations that is based on a functional view of neurons. We use optimal transport to quantify the match between two representations, yielding a distance that embeds some invariances inherent to the representation of deep networks. This distance defines a regularizer promoting the similarity of the students representation with that of the teacher. Our approach can be used in any learning context where representation transfer is applicable. We experiment here on two standard settings: inductive transfer learning, where the teachers representation is transferred to a student network of same architecture for a new related task, and knowledge distillation, where the teachers representation is transferred to a student of simpler architecture for the same task (model compression). Our approach also lends itself to solving new learning problems; we demonstrate this by showing how to directly transfer the teachers representation to a simpler architecture student for a new related task.
The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word movers distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.
Computing optimal transport maps between high-dimensional and continuous distributions is a challenging problem in optimal transport (OT). Generative adversarial networks (GANs) are powerful generative models which have been successfully applied to learn maps across high-dimensional domains. However, little is known about the nature of the map learned with a GAN objective. To address this problem, we propose a generative adversarial model in which the discriminators objective is the $2$-Wasserstein metric. We show that during training, our generator follows the $W_2$-geodesic between the initial and the target distributions. As a consequence, it reproduces an optimal map at the end of training. We validate our approach empirically in both low-dimensional and high-dimensional continuous settings, and show that it outperforms prior methods on image data.
Inverse optimal transport (OT) refers to the problem of learning the cost function for OT from observed transport plan or its samples. In this paper, we derive an unconstrained convex optimization formulation of the inverse OT problem, which can be further augmented by any customizable regularization. We provide a comprehensive characterization of the properties of inverse OT, including uniqueness of solutions. We also develop two numerical algorithms, one is a fast matrix scaling method based on the Sinkhorn-Knopp algorithm for discrete OT, and the other one is a learning based algorithm that parameterizes the cost function as a deep neural network for continuous OT. The novel framework proposed in the work avoids repeatedly solving a forward OT in each iteration which has been a thorny computational bottleneck for the bi-level optimization in existing inverse OT approaches. Numerical results demonstrate promising efficiency and accuracy advantages of the proposed algorithms over existing state-of-the-art methods.
The top-k operation, i.e., finding the k largest or smallest elements from a collection of scores, is an important model component, which is widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulting model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely the SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of the top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We apply the proposed operator to the k-nearest neighbors and beam search algorithms, and demonstrate improved performance.
Variational Auto-Encoders enforce their learned intermediate latent-space data distribution to be a simple distribution, such as an isotropic Gaussian. However, this causes the posterior collapse problem and loses manifold structure which can be important for datasets such as facial images. A GAN can transform a simple distribution to a latent-space data distribution and thus preserve the manifold structure, but optimizing a GAN involves solving a Min-Max optimization problem, which is difficult and not well understood so far. Therefore, we propose a GAN-like method to transform a simple distribution to a data distribution in the latent space by solving only a minimization problem. This minimization problem comes from training a discriminator between a simple distribution and a latent-space data distribution. Then, we can explicitly formulate an Optimal Transport (OT) problem that computes the desired mapping between the two distributions. This means that we can transform a distribution without solving the difficult Min-Max optimization problem. Experimental results on an eight-Gaussian dataset show that the proposed OT can handle multi-cluster distributions. Results on the MNIST and the CelebA datasets validate the effectiveness of the proposed method.