No Arabic abstract
Framing has significant but subtle effects on public opinion and policy. We propose an NLP framework to measure entity-centric frames. We use it to understand media coverage on police violence in the United States in a new Police Violence Frames Corpus of 82k news articles spanning 7k police killings. Our work uncovers more than a dozen framing devices and reveals significant differences in the way liberal and conservative news sources frame both the issue of police violence and the entities involved. Conservative sources emphasize when the victim is armed or attacking an officer and are more likely to mention the victims criminal record. Liberal sources focus more on the underlying systemic injustice, highlighting the victims race and that they were unarmed. We discover temporary spikes in these injustice frames near high-profile shooting events, and finally, we show protest volume correlates with and precedes media framing decisions.
While contextualized word representations have improved state-of-the-art benchmarks in many NLP tasks, their potential usefulness for social-oriented tasks remains largely unexplored. We show how contextualized word embeddings can be used to capture affect dimensions in portrayals of people. We evaluate our methodology quantitatively, on held-out affect lexicons, and qualitatively, through case examples. We find that contextualized word representations do encode meaningful affect information, but they are heavily biased towards their training data, which limits their usefulness to in-domain analyses. We ultimately use our method to examine differences in portrayals of men and women.
In open domain table-to-text generation, we notice that the unfaithful generation usually contains hallucinated content which can not be aligned to any input table record. We thus try to evaluate the generation faithfulness with two entity-centric metrics: table record coverage and the ratio of hallucinated entities in text, both of which are shown to have strong agreement with human judgements. Then based on these metrics, we quantitatively analyze the correlation between training data quality and generation fidelity which indicates the potential usage of entity information in faithful generation. Motivated by these findings, we propose two methods for faithful generation: 1) augmented training by incorporating the auxiliary entity information, including both an augmented plan-based model and an unsupervised model and 2) training instance selection based on faithfulness ranking. We show these approaches improve generation fidelity in both full dataset setting and few shot learning settings by both automatic and human evaluations.
I present here an experimental system for identifying and annotating metaphor in corpora. It is designed to plug in to Metacorps, an experimental web app for annotating metaphor. As Metacorps users annotate metaphors, the system will use user annotations as training data. When the system is confident, it will suggest an identification and an annotation. Once approved by the user, this becomes more training data. This naturally allows for transfer learning, where the system can, with some known degree of reliability, classify one class of metaphor after only being trained on another class of metaphor. For example, in our metaphorical violence project, metaphors may be classified by the network they were observed on, the grammatical subject or object of the violence metaphor, or the violent word used (hit, attack, beat, etc.).
Recent information extraction approaches have relied on training deep neural models. However, such models can easily overfit noisy labels and suffer from performance degradation. While it is very costly to filter noisy labels in large learning resources, recent studies show that such labels take more training steps to be memorized and are more frequently forgotten than clean labels, therefore are identifiable in training. Motivated by such properties, we propose a simple co-regularization framework for entity-centric information extraction, which consists of several neural models with identical structures but different parameter initialization. These models are jointly optimized with the task-specific losses and are regularized to generate similar predictions based on an agreement loss, which prevents overfitting on noisy labels. Extensive experiments on two widely used but noisy benchmarks for information extraction, TACRED and CoNLL03, demonstrate the effectiveness of our framework. We release our code to the community for future research.
In biomedical literature, it is common for entity boundaries to not align with word boundaries. Therefore, effective identification of entity spans requires approaches capable of considering tokens that are smaller than words. We introduce a novel, subword approach for named entity recognition (NER) that uses byte-pair encodings (BPE) in combination with convolutional and recurrent neural networks to produce byte-level tags of entities. We present experimental results on several standard biomedical datasets, namely the BioCreative VI Bio-ID, JNLPBA, and GENETAG datasets. We demonstrate competitive performance while bypassing the specialized domain expertise needed to create biomedical text tokenization rules.