How Lexical Gold Standards Have Effects On The Usefulness Of Text Analysis Tools For Digital Scholarship

63 0 0.0 ( 0 )

Download Cite

Added by Jussi Karlgren

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jussi Karlgren

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper describes how the current lexical similarity and analogy gold standards are built to conform to certain ideas about what the models they are designed to evaluate are used for. Topical relevance has always been the most important target notion for information access tools and related language technology technologies, and while this has proven a useful starting point for much of what information technology is used for, it does not always align well with other uses to which technologies are being put, most notably use cases from digital scholarship in the humanities or social sciences. This paper argues for more systematic formulation of requirements from the digital humanities and social sciences and more explicit description of the assumptions underlying model design.

rate research

Rethinking Text Attribute Transfer: A Lexical Analysis

174 - Yao Fu , Hao Zhou , Jiaze Chen 2019

Text attribute transfer is modifying certain linguistic attributes (e.g. sentiment, style, authorship, etc.) of a sentence and transforming them from one type to another. In this paper, we aim to analyze and interpret what is changed during the transfer process. We start from the observation that in many existing models and datasets, certain words within a sentence play important roles in determining the sentence attribute class. These words are referred to as textit{the Pivot Words}. Based on these pivot words, we propose a lexical analysis framework, textit{the Pivot Analysis}, to quantitatively analyze the effects of these words in text attribute classification and transfer. We apply this framework to existing datasets and models and show that: (1) the pivot words are strong features for the classification of sentence attributes; (2) to change the attribute of a sentence, many datasets only requires to change certain pivot words; (3) consequently, many transfer models only perform the lexical-level modification, while leaving higher-level sentence structures unchanged. Our work provides an in-depth understanding of linguistic attribute transfer and further identifies the future requirements and challenges of this taskfootnote{Our code can be found at https://github.com/FranxYao/pivot_analysis}.

Computation and Language Machine Learning

Comparative Probing of Lexical Semantics Theories for Cognitive Plausibility and Technological Usefulness

124 - Antonio Branco , Jo~ao Rodrigues , Ma{l}gorzata Salawa 2020

Lexical semantics theories differ in advocating that the meaning of words is represented as an inference graph, a feature mapping or a vector space, thus raising the question: is it the case that one of these approaches is superior to the others in representing lexical semantics appropriately? Or in its non antagonistic counterpart: could there be a unified account of lexical semantics where these approaches seamlessly emerge as (partial) renderings of (different) aspects of a core semantic knowledge base? In this paper, we contribute to these research questions with a number of experiments that systematically probe different lexical semantics theories for their levels of cognitive plausibility and of technological usefulness. The empirical findings obtained from these experiments advance our insight on lexical semantics as the feature-based approach emerges as superior to the other ones, and arguably also move us closer to finding answers to the research questions above.

Computation and Language

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

96 - Shohei Higashiyama , Masao Utiyama , Taro Watanabe 2021

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

Computation and Language

What Have We Achieved on Text Summarization?

197 - Dandan Huang , Leyang Cui , Sen Yang 2020

Deep learning has led to significant improvement in text summarization with various methods investigated and improved ROUGE scores reported over the years. However, gaps still exist between summaries produced by automatic summarizers and human professionals. Aiming to gain more understanding of summarization systems with respect to their strengths and limits on a fine-grained syntactic and semantic level, we consult the Multidimensional Quality Metric(MQM) and quantify 8 major sources of errors on 10 representative summarization models manually. Primarily, we find that 1) under similar settings, extractive summarizers are in general better than their abstractive counterparts thanks to strength in faithfulness and factual-consistency; 2) milestone techniques such as copy, coverage and hybrid extractive/abstractive methods do bring specific improvements but also demonstrate limitations; 3) pre-training techniques, and in particular sequence-to-sequence pre-training, are highly effective for improving text summarization, with BART giving the best results.

Computation and Language

How Can BERT Help Lexical Semantics Tasks?

142 - Yile Wang , Leyang Cui , Yue Zhang 2019

Contextualized embeddings such as BERT can serve as strong input representations to NLP tasks, outperforming their static embeddings counterparts such as skip-gram, CBOW and GloVe. However, such embeddings are dynamic, calculated according to a sentence-level context, which limits their use in lexical semantics tasks. We address this issue by making use of dynamic embeddings as word representations in training static embeddings, thereby leveraging their strong representation power for disambiguating context information. Results show that this method leads to improvements over traditional static embeddings on a range of lexical semantics tasks, obtaining the best reported results on seven datasets.

Computation and Language