No Arabic abstract
When manipulating three-dimensional data, it is possible to ensure that rotational and translational symmetries are respected by applying so-called SE(3)-equivariant models. Protein structure prediction is a prominent example of a task which displays these symmetries. Recent work in this area has successfully made use of an SE(3)-equivariant model, applying an iterative SE(3)-equivariant attention mechanism. Motivated by this application, we implement an iterative version of the SE(3)-Transformer, an SE(3)-equivariant attention-based model for graph data. We address the additional complications which arise when applying the SE(3)-Transformer in an iterative fashion, compare the iterative and single-pa
We introduce the SE(3)-Transformer, a variant of the self-attention module for 3D point clouds and graphs, which is equivariant under continuous 3D roto-translations. Equivariance is important to ensure stable and predictable performance in the presence of nuisance transformations of the data input. A positive corollary of equivariance is increased weight-tying within the model. The SE(3)-Transformer leverages the benefits of self-attention to operate on large point clouds and graphs with varying number of points, while guaranteeing SE(3)-equivariance for robustness. We evaluate our model on a toy N-body particle simulation dataset, showcasing the robustness of the predictions under rotations of the input. We further achieve competitive performance on two real-world datasets, ScanObjectNN and QM9. In all cases, our model outperforms a strong, non-equivariant attention baseline and an equivariant model without attention.
The task of mapping two or more distributions to a shared representation has many applications including fair representations, batch effect mitigation, and unsupervised domain adaptation. However, most existing formulations only consider the setting of two distributions, and moreover, do not have an identifiable, unique shared latent representation. We use optimal transport theory to consider a natural multiple distribution extension of the Monge assignment problem we call the symmetric Monge map problem and show that it is equivalent to the Wasserstein barycenter problem. Yet, the maps to the barycenter are challenging to estimate. Prior methods often ignore transportation cost, rely on adversarial methods, or only work for discrete distributions. Therefore, our goal is to estimate invertible maps between two or more distributions and their corresponding barycenter via a simple iterative flow method. Our method decouples each iteration into two subproblems: 1) estimate simple distributions and 2) estimate the invertible maps to the barycenter via known closed-form OT results. Our empirical results give evidence that this iterative algorithm approximates the maps to the barycenter.
Iterative machine teaching is a method for selecting an optimal teaching example that enables a student to efficiently learn a target concept at each iteration. Existing studies on iterative machine teaching are based on supervised machine learning and assume that there are teachers who know the true answers of all teaching examples. In this study, we consider an unsupervised case where such teachers do not exist; that is, we cannot access the true answer of any teaching example. Students are given a teaching example at each iteration, but there is no guarantee if the corresponding label is correct. Recent studies on crowdsourcing have developed methods for estimating the true answers from crowdsourcing responses. In this study, we apply these to iterative machine teaching for estimating the true labels of teaching examples along with student models that are used for teaching. Our method supports the collaborative learning of students without teachers. The experimental results show that the teaching performance of our method is particularly effective for low-level students in particular.
Self-attention, as the key block of transformers, is a powerful mechanism for extracting features from the inputs. In essence, what self-attention does is to infer the pairwise relations between the elements of the inputs, and modify the inputs by propagating information between input pairs. As a result, it maps inputs to N outputs and casts a quadratic $O(N^2)$ memory and time complexity. We propose centroid attention, a generalization of self-attention that maps N inputs to M outputs $(Mleq N)$, such that the key information in the inputs are summarized in the smaller number of outputs (called centroids). We design centroid attention by amortizing the gradient descent update rule of a clustering objective function on the inputs, which reveals an underlying connection between attention and clustering. By compressing the inputs to the centroids, we extract the key information useful for prediction and also reduce the computation of the attention module and the subsequent layers. We apply our method to various applications, including abstractive text summarization, 3D vision, and image processing. Empirical results demonstrate the effectiveness of our method over the standard transformers.
Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local attention with relative position encodings and a Contrastive Predictive Coding (CPC) pre-training objective allows ETC to encode structured inputs. We achieve state-of-the-art results on four natural language datasets requiring long and/or structured inputs.