Attention Can Reflect Syntactic Structure (If You Let It)

115 0 0.0 ( 0 )

Download Cite

Added by Artur Kulmizev

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Vinit Ravishankar - Artur Kulmizev - Mostafa Abdou

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Since the popularization of the Transformer as a general-purpose feature encoder for NLP, many studies have attempted to decode linguistic structure from its novel multi-head attention mechanism. However, much of such work focused almost exclusively on English -- a language with rigid word order and a lack of inflectional morphology. In this study, we present decoding experiments for multilingual BERT across 18 languages in order to test the generalizability of the claim that dependency syntax is reflected in attention patterns. We show that full trees can be decoded above baseline accuracy from single attention heads, and that individual relations are often tracked by the same heads across languages. Furthermore, in an attempt to address recent debates about the status of attention as an explanatory mechanism, we experiment with fine-tuning mBERT on a supervised parsing objective while freezing different series of parameters. Interestingly, in steering the objective to learn explicit linguistic structure, we find much of the same structure represented in the resulting attention patterns, with interesting differences with respect to which parameters are frozen.

rate research

You Can Do Better! If You Elaborate the Reason When Making Prediction

138 - Dongfang Li , Jingcong Tao , Qingcai Chen 2021

Neural predictive models have achieved remarkable performance improvements in various natural language processing tasks. However, most neural predictive models suffer from the lack of explainability of predictions, limiting their practical utility. This paper proposes a neural predictive approach to make a prediction and generate its corresponding explanation simultaneously. It leverages the knowledge entailed in explanations as an additional distillation signal for more efficient learning. We conduct a preliminary study on Chinese medical multiple-choice question answering, English natural language inference, and commonsense question answering tasks. The experimental results show that the proposed approach can generate reasonable explanations for its predictions even with a small-scale training corpus. The proposed method also achieves improved prediction accuracy on three datasets, which indicates that making predictions can benefit from generating the explanation in the decision process.

Computation and Language Artificial Intelligence

Fastformer: Additive Attention Can Be All You Need

91 - Chuhan Wu , Fangzhao Wu , Tao Qi 2021

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.

Computation and Language

Direct scattering transform: catch soliton if you can

152 - Andrey Gelash , Rustam Mullyadzhanov 2019

Direct scattering transform of nonlinear wave fields with solitons may lead to anomalous numerical errors of soliton phase and position parameters. With the focusing one-dimensional nonlinear Schrodinger equation serving as a model, we investigate this fundamental issue theoretically. Using the dressing method we find the landscape of soliton scattering coefficients in the plane of the complex spectral parameter for multi-soliton wave fields truncated within a finite domain, allowing us to capture the nature of particular numerical errors. They depend on the size of the computational domain $L$ leading to a counterintuitive exponential divergence when increasing $L$ in the presence of a small uncertainty in soliton eigenvalues. In contrast to classical textbooks, we reveal how one of the scattering coefficients loses its analytical properties due to the lack of the wave field compact support in case of $L to infty$. Finally, we demonstrate that despite this inherit direct scattering transform feature, the wave fields of arbitrary complexity can be reliably analysed.

Exactly Solvable and Integrable Systems Pattern Formation and Solitons

Can You Put it All Together: Evaluating Conversational Agents Ability to Blend Skills

101 - Eric Michael Smith , Mary Williamson , Kurt Shuster 2020

Being engaging, knowledgeable, and empathetic are all desirable general qualities in a conversational agent. Previous work has introduced tasks and datasets that aim to help agents to learn those qualities in isolation and gauge how well they can express them. But rather than being specialized in one single quality, a good open-domain conversational agent should be able to seamlessly blend them all into one cohesive conversational flow. In this work, we investigate several ways to combine models trained towards isolated capabilities, ranging from simple model aggregation schemes that require minimal additional training, to various forms of multi-task training that encompass several skills at all training stages. We further propose a new dataset, BlendedSkillTalk, to analyze how these capabilities would mesh together in a natural conversation, and compare the performance of different architectures and training schemes. Our experiments show that multi-tasking over several tasks that focus on particular capabilities results in better blended conversation performance compared to models trained on a single skill, and that both unified or two-stage approaches perform well if they are constructed to avoid unwanted bias in skill selection or are fine-tuned on our new task.

Computation and Language

Trick Me If You Can: Human-in-the-loop Generation of Adversarial Examples for Question Answering

91 - Eric Wallace , Pedro Rodriguez , Shi Feng 2018

Adversarial evaluation stress tests a models understanding of natural language. While past approaches expose superficial patterns, the resulting adversarial examples are limited in complexity and diversity. We propose human-in-the-loop adversarial generation, where human authors are guided to break models. We aid the authors with interpretations of model predictions through an interactive user interface. We apply this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions. The resulting questions are validated via live human--computer matches: although the questions appear ordinary to humans, they systematically stump neural and information retrieval models. The adversarial questions cover diverse phenomena from multi-hop reasoning to entity type distractors, exposing open challenges in robust question answering.

Computation and Language