Sum-Product-Attention Networks: Leveraging Self-Attention in Probabilistic Circuits

116 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhongjie Yu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zhongjie Yu - Devendra Singh Dhami - Kristian Kersting

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Probabilistic circuits (PCs) have become the de-facto standard for learning and inference in probabilistic modeling. We introduce Sum-Product-Attention Networks (SPAN), a new generative model that integrates probabilistic circuits with Transformers. SPAN uses self-attention to select the most relevant parts of a probabilistic circuit, here sum-product networks, to improve the modeling capability of the underlying sum-product network. We show that while modeling, SPAN focuses on a specific set of independent assumptions in every product layer of the sum-product network. Our empirical evaluations show that SPAN outperforms state-of-the-art probabilistic generative models on various benchmark data sets as well is an efficient generative image model.

قيم البحث

اقرأ أيضاً

Sum-Product-Quotient Networks

62 - Or Sharir , Amnon Shashua 2017

We present a novel tractable generative model that extends Sum-Product Networks (SPNs) and significantly boosts their power. We call it Sum-Product-Quotient Networks (SPQNs), whose core concept is to incorporate conditional distributions into the mod el by direct computation using quotient nodes, e.g. $P(A|B) = frac{P(A,B)}{P(B)}$. We provide sufficient conditions for the tractability of SPQNs that generalize and relax the decomposable and complete tractability conditions of SPNs. These relaxed conditions give rise to an exponential boost to the expressive efficiency of our model, i.e. we prove that there are distributions which SPQNs can compute efficiently but require SPNs to be of exponential size. Thus, we narrow the gap in expressivity between tractable graphical models and other Neural Network-based generative models.

التعلم الآلي الحوسبة العصبية والتطورية التعلم الالي

Self-Attention Graph Pooling

93 - Junhyun Lee , Inyeop Lee , Jaewoo Kang 2019

Advanced methods of applying deep learning to structured data such as graphs have been proposed in recent years. In particular, studies have focused on generalizing convolutional neural networks to graph data, which includes redefining the convolutio n and the downsampling (pooling) operations for graphs. The method of generalizing the convolution operation to graphs has been proven to improve performance and is widely used. However, the method of applying downsampling to graphs is still difficult to perform and has room for improvement. In this paper, we propose a graph pooling method based on self-attention. Self-attention using graph convolution allows our pooling method to consider both node features and graph topology. To ensure a fair comparison, the same training procedures and model architectures were used for the existing pooling methods and our method. The experimental results demonstrate that our method achieves superior graph classification performance on the benchmark datasets using a reasonable number of parameters.

التعلم الآلي التعلم الالي

Capsules with Inverted Dot-Product Attention Routing

74 - Yao-Hung Hubert Tsai , Nitish Srivastava , Hanlin Goh 2020

We introduce a new routing algorithm for capsule networks, in which a child capsule is routed to a parent based only on agreement between the parents state and the childs vote. The new mechanism 1) designs routing via inverted dot-product attention; 2) imposes Layer Normalization as normalization; and 3) replaces sequential iterative routing with concurrent iterative routing. When compared to previously proposed routing algorithms, our method improves performance on benchmark datasets such as CIFAR-10 and CIFAR-100, and it performs at-par with a powerful CNN (ResNet-18) with 4x fewer parameters. On a different task of recognizing digits from overlayed digit images, the proposed capsule model performs favorably against CNNs given the same number of layers and neurons per layer. We believe that our work raises the possibility of applying capsule networks to complex real-world tasks. Our code is publicly available at: https://github.com/apple/ml-capsules-inverted-attention-routing An alternative implementation is available at: https://github.com/yaohungt/Capsules-Inverted-Attention-Routing/blob/master/README.md

التعلم الآلي التعلم الالي

Sparse Graph Attention Networks

145 - Yang Ye , , Shihao Ji 2019

Graph Neural Networks (GNNs) have proved to be an effective representation learning framework for graph-structured data, and have achieved state-of-the-art performance on many practical predictive tasks, such as node classification, link prediction a nd graph classification. Among the variants of GNNs, Graph Attention Networks (GATs) learn to assign dense attention coefficients over all neighbors of a node for feature aggregation, and improve the performance of many graph learning tasks. However, real-world graphs are often very large and noisy, and GATs are prone to overfitting if not regularized properly. Even worse, the local aggregation mechanism of GATs may fail on disassortative graphs, where nodes within local neighborhood provide more noise than useful information for feature aggregation. In this paper, we propose Sparse Graph Attention Networks (SGATs) that learn sparse attention coefficients under an $L_0$-norm regularization, and the learned sparse attentions are then used for all GNN layers, resulting in an edge-sparsified graph. By doing so, we can identify noisy/task-irrelevant edges, and thus perform feature aggregation on most informative neighbors. Extensive experiments on synthetic and real-world graph learning benchmarks demonstrate the superior performance of SGATs. In particular, SGATs can remove about 50%-80% edges from large assortative graphs, while retaining similar classification accuracies. On disassortative graphs, SGATs prune majority of noisy edges and outperform GATs in classification accuracies by significant margins. Furthermore, the removed edges can be interpreted intuitively and quantitatively. To the best of our knowledge, this is the first graph learning algorithm that shows significant redundancies in graphs and edge-sparsified graphs can achieve similar or sometimes higher predictive performances than original graphs.

التعلم الآلي التعلم الالي

Bayesian Attention Belief Networks

84 - Shujian Zhang , Xinjie Fan , Bo Chen 2021

Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. Most such models use deterministic attention while stochastic attention is less explored due to the optimization difficulties or complicated model design . This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights with a hierarchy of gamma distributions, and an encoder network by stacking Weibull distributions with a deterministic-upward-stochastic-downward structure to approximate the posterior. The resulting auto-encoding networks can be optimized in a differentiable way with a variational lower bound. It is simple to convert any models with deterministic attention, including pretrained ones, to the proposed Bayesian attention belief networks. On a variety of language understanding tasks, we show that our method outperforms deterministic attention and state-of-the-art stochastic attention in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks. We further demonstrate the general applicability of our method on neural machine translation and visual question answering, showing great potential of incorporating our method into various attention-related tasks.

التعلم الآلي الحساب واللغة التعلم الالي