Laughing Heads: Can Transformers Detect What Makes a Sentence Funny?

89 0 0.0 ( 0 )

Download Cite

Added by Maxime Peyrard

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Maxime Peyrard - Beatriz Borges - Kristina Gligoric

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The automatic detection of humor poses a grand challenge for natural language processing. Transformer-based systems have recently achieved remarkable results on this task, but they usually (1)~were evaluated in setups where serious vs humorous texts came from entirely different sources, and (2)~focused on benchmarking performance without providing insights into how the models work. We make progress in both respects by training and analyzing transformer-based humor recognition models on a recently introduced dataset consisting of minimal pairs of aligned sentences, one serious, the other humorous. We find that, although our aligned dataset is much harder than previous datasets, transformer-based models recognize the humorous sentence in an aligned pair with high accuracy (78%). In a careful error analysis, we characterize easy vs hard instances. Finally, by analyzing attention weights, we obtain important insights into the mechanisms by which transformers recognize humor. Most remarkably, we find clear evidence that one single attention head learns to recognize the words that make a test sentence humorous, even without access to this information at training time.

rate research

Why Can You Lay Off Heads? Investigating How BERT Heads Transfer

108 - Ting-Rui Chiang , Yun-Nung Chen 2021

The huge size of the widely used BERT family models has led to recent efforts about model distillation. The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version. Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has not been well studied. Also, the mechanisms behind transfer learning of those BERT models are not well investigated either. Therefore, this work focuses on analyzing the acceptable deduction when distillation for guiding the future distillation procedure. Specifically, we first inspect the prunability of the Transformer heads in RoBERTa and ALBERT using their head importance estimation proposed by Michel et al. (2019), and then check the coherence of the important heads between the pre-trained task and downstream tasks. Hence, the acceptable deduction of performance on the pre-trained task when distilling a model can be derived from the results, and we further compare the behavior of the pruned model before and after fine-tuning. Our studies provide guidance for future directions about BERT family model distillation.

Computation and Language

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

71 - Junjie Hu , Yu Cheng , Zhe Gan 2019

Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a realistically-natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a high-quality story to the human eye. Following this quality guideline, we propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluations demonstrate that our ReCo-RL model achieves better performance than state-of-the-art baselines on both traditional metrics and the proposed new criteria.

Computation and Language

What Makes Good In-Context Examples for GPT-$3$?

111 - Jiachang Liu , Dinghan Shen , Yizhe Zhang 2021

GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context few-shot learning ability. Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT-$3$s few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-$3$s extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders fine-tuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-to-text generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-$3$ and large-scale pre-trained LMs in general and enhance their few-shot capabilities.

Computation and Language

What makes a galaxy radio-loud?

137 - R. A. Ortega-Minakata , J. P. Torres-Papaqui , H. Andernach 2012

We compare the Spectral Energy Distribution (SED) of radio-loud and radio-quiet AGNs in three different samples observed with SDSS: radio-loud AGNs (RLAGNs), Low Luminosity AGNs (LLAGNs) and AGNs in isolated galaxies (IG-AGNs). All these galaxies have similar optical spectral characteristics. The median SED of the RLAGNs is consistent with the characteristic SED of quasars, while that of the LLAGNs and IG-AGNs are consistent with the SED of LINERs, with a lower luminosity in the IG-AGNs than in the LLAGNs. We infer the masses of the black holes (BHs) from the bulge masses. These increase from the IG-AGNs to the LLAGNs and are highest for the RLAGNs. All these AGNs show accretion rates near or slightly below 10% of the Eddington limit, the differences in luminosity being solely due to different BH masses. Our results suggests there are two types of AGNs, radio quiet and radio loud, differing only by the mass of their bulges or BHs.

Cosmology and Nongalactic Astrophysics

What makes a complex liquid complex?

275 - Alexander Z. Patashinski , Rafal Orlik , Antoni C. Mitus 2012

We view a complex liquid as a network of bonds connecting each particle to its nearest neighbors; the dynamics of this network is a chain of discrete events signaling particles rearrangements. Within this picture, we studied a two-dimensional complex liquid and found a stretched-exponential decay of the network memory and a power-law for the distribution of the times for which a particle keeps its nearest neighbors; the dependence of this distribution on temperature suggests a possible dynamical critical point. We identified and quantified the underlying spatio-temporal phenomena. The equilibrium liquid represents a hierarchical structure, a mosaic of long-living crystallites partially separated by less-ordered regions. The long-time dynamics of this structure is dominated by particles redistribution between dynamically and structurally different regions. We argue that these are generic features of locally ordered but globally disordered complex systems. In particular, these features must be taken into account by any coarse-grained theory of dynamics of complex fluids and glasses.

Disordered Systems and Neural Networks Soft Condensed Matter Statistical Mechanics