Catchphrase: Automatic Detection of Cultural References

79 0 0.0 ( 0 )

Download Cite

Added by Nir Sweed

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Nir Sweed - Dafna Shahaf

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

A snowclone is a customizable phrasal template that can be realized in multiple, instantly recognized variants. For example, ``* is the new * (Orange is the new black, 40 is the new 30). Snowclones are extensively used in social media. In this paper, we study snowclones originating from pop-culture quotes; our goal is to automatically detect cultural references in text. We introduce a new, publicly available data set of pop-culture quotes and their corresponding snowclone usages and train models on them. We publish code for Catchphrase, an internet browser plugin to automatically detect and mark references in real-time, and examine its performance via a user study. Aside from assisting people to better comprehend cultural references, we hope that detecting snowclones can complement work on paraphrasing and help to tackle long-standing questions in social science about the dynamics of information propagation.

rate research

Deception detection in text and its relation to the cultural dimension of individualism/collectivism

64 - Katerina Papantoniou , Panagiotis Papadakos , Theodore Patkos 2021

Deception detection is a task with many applications both in direct physical and in computer-mediated communication. Our focus is on automatic deception detection in text across cultures. We view culture through the prism of the individualism/collectivism dimension and we approximate culture by using country as a proxy. Having as a starting point recent conclusions drawn from the social psychology discipline, we explore if differences in the usage of specific linguistic features of deception across cultures can be confirmed and attributed to norms in respect to the individualism/collectivism divide. We also investigate if a universal feature set for cross-cultural text deception detection tasks exists. We evaluate the predictive power of different feature sets and approaches. We create culture/language-aware classifiers by experimenting with a wide range of n-gram features based on phonology, morphology and syntax, other linguistic cues like word and phoneme counts, pronouns use, etc., and token embeddings. We conducted our experiments over 11 datasets from 5 languages i.e., English, Dutch, Russian, Spanish and Romanian, from six countries (US, Belgium, India, Russia, Mexico and Romania), and we applied two classification methods i.e, logistic regression and fine-tuned BERT models. The results showed that our task is fairly complex and demanding. There are indications that some linguistic cues of deception have cultural origins, and are consistent in the context of diverse domains and dataset settings for the same language. This is more evident for the usage of pronouns and the expression of sentiment in deceptive language. The results of this work show that the automatic deception detection across cultures and languages cannot be handled in a unified manner, and that such approaches should be augmented with knowledge about cultural differences and the domains of interest.

Computation and Language

Improving Simultaneous Translation with Pseudo References

323 - Junkun Chen , Renjie Zheng , Atsuhito Kita 2020

Simultaneous translation is vastly different from full-sentence translation, in the sense that it starts translation before the source sentence ends, with only a few words delay. However, due to the lack of large scale and publicly available simultaneous translation datasets, most simultaneous translation systems still train with ordinary full-sentence parallel corpora which are not suitable for the simultaneous scenario due to the existence of unnecessary long-distance reorderings. Instead of expensive, time-consuming annotation, we propose a novel method that rewrites the target side of existing full-sentence corpus into simultaneous-style translation. Experiments on Chinese-to-English translation demonstrate about +2.7 BLEU improvements with the addition of newly generated pseudo references.

Computation and Language

Exploring Methods for the Automatic Detection of Errors in Manual Transcription

107 - Xiaofei Wang , Jinyi Yang , Ruizhi Li 2019

Quality of data plays an important role in most deep learning tasks. In the speech community, transcription of speech recording is indispensable. Since the transcription is usually generated artificially, automatically finding errors in manual transcriptions not only saves time and labors but benefits the performance of tasks that need the training process. Inspired by the success of hybrid automatic speech recognition using both language model and acoustic model, two approaches of automatic error detection in the transcriptions have been explored in this work. Previous study using a biased language model approach, relying on a strong transcription-dependent language model, has been reviewed. In this work, we propose a novel acoustic model based approach, focusing on the phonetic sequence of speech. Both methods have been evaluated on a completely real dataset, which was originally transcribed with errors and strictly corrected manually afterwards.

Computation and Language Sound Audio and Speech Processing

Automatic Detection of Generated Text is Easiest when Humans are Fooled

87 - Daphne Ippolito , Daniel Duckworth , Chris Callison-Burch 2019

Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies---top-$k$, nucleus sampling, and untruncated random sampling---and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems.

Computation and Language

References of References: How Far is the Knowledge Ancestry

86 - Chao Min , Jiawei Xu , Tao Han 2021

Scientometrics studies have extended from direct citations to high-order citations, as simple citation count is found to tell only part of the story regarding scientific impact. This extension is deemed to be beneficial in scenarios like research evaluation, science history modeling, and information retrieval. In contrast to citations of citations (forward citation generations), references of references (backward citation generations) as another side of high-order citations, is relatively less explored. We adopt a series of metrics for measuring the unfolding of backward citations of a focal paper, tracing back to its knowledge ancestors generation by generation. Two sub-fields in Physics are subject to such analysis on a large-scale citation network. Preliminary results show that (1) most papers in our dataset can be traced to their knowledge ancestry; (2) the size distribution of backward citation generations presents a decreasing-and-then-increasing shape; and (3) citations more than one generation away are still relevant to the focal paper, from either a forward or backward perspective; yet, backward citation generations are higher in topic relevance to the paper of interest. Furthermore, the backward citation generations shed lights for literature recommendation, science evaluation, and sociology of science studies.

Digital Libraries