No Arabic abstract
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. Following a probabilistic view of the attention via the Gaussian mixture model, we find empirical evidence that the Transformer attention tends to explain away certain input neurons. To compensate for this, we propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the explaining away effect without introducing significant computational or memory cost. Empirically, we show that the new attention schemes result in improved performance on several well-known benchmarks.
In this paper we propose that cosmological time is a quantum observable that does not commute with other quantum operators essential for the definition of cosmological states, notably the cosmological constant. This is inspired by properties of a measure of time---the Chern-Simons time---and the fact that in some theories it appears as a conjugate to the cosmological constant, with the two promoted to non-commuting quantum operators. Thus, the Universe may be delocalised in time: it does not {it know} the time, a property which opens up new cosmological scenarios, as well as invalidating several paradoxes, such as the timelike tower of turtles associated with an omnipresent time line. Alternatively, a Universe with a sharply defined clock time must have an indeterminate cosmological constant. The challenge then is to explain how islands of localized time may emerge, and give rise to localized histories. In some scenarios this is achieved by backward transitions in quantum time, cycling the Universe in something akin to a time machine cycle, with classical flow and quantum ebbing. The emergence on matter in a sea of Lambda probably provides the ballast behind classical behaviour.
We present a simple general proof that Casimir force cannot originate from the vacuum energy of electromagnetic (EM) field. The full QED Hamiltonian consists of 3 terms: the pure electromagnetic term $H_{rm em}$, the pure matter term $H_{rm matt}$ and the interaction term $H_{rm int}$. The $H_{rm em}$-term commutes with all matter fields because it does not have any explicit dependence on matter fields. As a consequence, $H_{rm em}$ cannot generate any forces on matter. Since it is precisely this term that generates the vacuum energy of EM field, it follows that the vacuum energy does not generate the forces. The misleading statements in the literature that vacuum energy generates Casimir force can be boiled down to the fact that $H_{rm em}$ attains an implicit dependence on matter fields by the use of the equations of motion and the illegitimate treatment of the implicit dependence as if it was explicit. The true origin of the Casimir force is van der Waals force generated by $H_{rm int}$.
We construct a finitely generated group that does not satisfy the generalized Burghelea conjecture.
Although deep neural networks generally have fixed network structures, the concept of dynamic mechanism has drawn more and more attention in recent years. Attention mechanisms compute input-dependent dynamic attention weights for aggregating a sequence of hidden states. Dynamic network configuration in convolutional neural networks (CNNs) selectively activates only part of the network at a time for different inputs. In this paper, we combine the two dynamic mechanisms for text classification tasks. Traditional attention mechanisms attend to the whole sequence of hidden states for an input sentence, while in most cases not all attention is needed especially for long sequences. We propose a novel method called Gated Attention Network (GA-Net) to dynamically select a subset of elements to attend to using an auxiliary network, and compute attention weights to aggregate the selected elements. It avoids a significant amount of unnecessary computation on unattended elements, and allows the model to pay attention to important parts of the sequence. Experiments in various datasets show that the proposed method achieves better performance compared with all baseline models with global or local attention while requiring less computation and achieving better interpretability. It is also promising to extend the idea to more complex attention-based models, such as transformers and seq-to-seq models.
Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of good clustering. We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.