Attention that does not Explain Away

256 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Nan Ding

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Nan Ding - Xinjie Fan - Zhenzhong Lan

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. Following a probabilistic view of the attention via the Gaussian mixture model, we find empirical evidence that the Transformer attention tends to explain away certain input neurons. To compensate for this, we propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the explaining away effect without introducing significant computational or memory cost. Empirically, we show that the new attention schemes result in improved performance on several well-known benchmarks.

قيم البحث

اقرأ أيضاً

A Universe that does not know the time

135 - Joao Magueijo , Lee Smolin 2018

In this paper we propose that cosmological time is a quantum observable that does not commute with other quantum operators essential for the definition of cosmological states, notably the cosmological constant. This is inspired by properties of a mea sure of time---the Chern-Simons time---and the fact that in some theories it appears as a conjugate to the cosmological constant, with the two promoted to non-commuting quantum operators. Thus, the Universe may be delocalised in time: it does not {it know} the time, a property which opens up new cosmological scenarios, as well as invalidating several paradoxes, such as the timelike tower of turtles associated with an omnipresent time line. Alternatively, a Universe with a sharply defined clock time must have an indeterminate cosmological constant. The challenge then is to explain how islands of localized time may emerge, and give rise to localized histories. In some scenarios this is achieved by backward transitions in quantum time, cycling the Universe in something akin to a time machine cycle, with classical flow and quantum ebbing. The emergence on matter in a sea of Lambda probably provides the ballast behind classical behaviour.

النسبية العامة وهدية الكونيات الكم علم الكونيات والفيزياء الفلكية Nongalactic

Proof that Casimir force does not originate from vacuum energy

175 - H. Nikolic 2016

We present a simple general proof that Casimir force cannot originate from the vacuum energy of electromagnetic (EM) field. The full QED Hamiltonian consists of 3 terms: the pure electromagnetic term $H_{rm em}$, the pure matter term $H_{rm matt}$ an d the interaction term $H_{rm int}$. The $H_{rm em}$-term commutes with all matter fields because it does not have any explicit dependence on matter fields. As a consequence, $H_{rm em}$ cannot generate any forces on matter. Since it is precisely this term that generates the vacuum energy of EM field, it follows that the vacuum energy does not generate the forces. The misleading statements in the literature that vacuum energy generates Casimir force can be boiled down to the fact that $H_{rm em}$ attains an implicit dependence on matter fields by the use of the equations of motion and the illegitimate treatment of the implicit dependence as if it was explicit. The true origin of the Casimir force is van der Waals force generated by $H_{rm int}$.

الفيزياء عالية الطاقة - النظرية الميكانيكا الإحصائية النسبية العامة وهدية الكونيات الكم

A finitely generated group that does not satisfy the generalized Burghelea Conjecture

72 - A. Dranishnikov , M. Hull 2017

We construct a finitely generated group that does not satisfy the generalized Burghelea conjecture.

نظرية و الهندسة الطوبولوجيا الجبرية نظرية المجموعة

Not All Attention Is Needed: Gated Attention Network for Sequence Data

110 - Lanqing Xue , Xiaopeng Li , Nevin L. Zhang 2019

Although deep neural networks generally have fixed network structures, the concept of dynamic mechanism has drawn more and more attention in recent years. Attention mechanisms compute input-dependent dynamic attention weights for aggregating a sequen ce of hidden states. Dynamic network configuration in convolutional neural networks (CNNs) selectively activates only part of the network at a time for different inputs. In this paper, we combine the two dynamic mechanisms for text classification tasks. Traditional attention mechanisms attend to the whole sequence of hidden states for an input sentence, while in most cases not all attention is needed especially for long sequences. We propose a novel method called Gated Attention Network (GA-Net) to dynamically select a subset of elements to attend to using an auxiliary network, and compute attention weights to aggregate the selected elements. It avoids a significant amount of unnecessary computation on unattended elements, and allows the model to pay attention to important parts of the sequence. Experiments in various datasets show that the proposed method achieves better performance compared with all baseline models with global or local attention while requiring less computation and achieving better interpretability. It is also promising to extend the idea to more complex attention-based models, such as transformers and seq-to-seq models.

التعلم الآلي الحساب واللغة التعلم الالي

Clustering is difficult only when it does not matter

151 - Amit Daniely , Nati Linial , Michael Saks 2012

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {em that can be clustered well}. More generally, despite the ubiquity and the great import ance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of good clustering. We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

التعلم الآلي بنى وهياكل البيانات والخوارزميات