Research papers, master and doctoral theses about نصوص

Understanding Model Robustness to User-generated Noisy Texts

398 - Association for Computation Linguistics 2021 مقالة

Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage artificially nois ed data. However, the amount and type of generated noise has so far been determined arbitrarily. We therefore propose to model the errors statistically from grammatical-error-correction corpora. We present a thorough evaluation of several state-of-the-art NLP systems' robustness in multiple languages, with tasks including morpho-syntactic analysis, named entity recognition, neural machine translation, a subset of the GLUE benchmark and reading comprehension. We also compare two approaches to address the performance drop: a) training the NLP models with noised data generated by our framework; and b) reducing the input noise with external system for natural language correction. The code is released at https://github.com/ufal/kazitext.

user-generated noisy texts noisy texts user-generated noisy النصوص الناتجة عن المستخدم نصوص صاخبة صاخبة التي تم إنشاؤها صناعة حمض الفوسفور المزيد..

SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts

294 - Association for Computation Linguistics 2021 مقالة

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, educat ion, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models. We have made the dataset and accompanying models presented in this paper publicly available at https://git.io/JuuNB.

noisy bangla texts bangla texts written bangla texts نصوص البنغالية الصاخبة نصوص البنغالية نصوص البنغالية المكتوبة صناعة حمض الفوسفور المزيد..

From None to Severe: Predicting Severity in Movie Scripts

139 - Association for Computation Linguistics 2021 مقالة

In this paper, we introduce the task of predicting severity of age-restricted aspects of movie content based solely on the dialogue script. We first investigate categorizing the ordinal severity of movies on 5 aspects: Sex, Violence, Profanity, Subst ance consumption, and Frightening scenes. The problem is handled using a siamese network-based multitask framework which concurrently improves the interpretability of the predictions. The experimental results show that our method outperforms the previous state-of-the-art model and provides useful information to interpret model predictions. The proposed dataset and source code are publicly available at our GitHub repository.

predicting severity movie scripts movie content based التنبؤ بشدة نصوص الفيلم محتوى الفيلم على أساس صناعة حمض الفوسفور المزيد..

Eliciting Bias in Question Answering Models through Ambiguity

195 - Association for Computation Linguistics 2021 مقالة

Question answering (QA) models use retriever and reader systems to answer questions. Reliance on training data by QA systems can amplify or reflect inequity through their responses. Many QA models, such as those for the SQuAD dataset, are trained and tested on a subset of Wikipedia articles which encode their own biases and also reproduce real-world inequality. Understanding how training data affects bias in QA systems can inform methods to mitigate inequity. We develop two sets of questions for closed and open domain questions respectively, which use ambiguous questions to probe QA models for bias. We feed three deep-learning-based QA systems with our question sets and evaluate responses for bias via the metrics. Using our metrics, we find that open-domain QA models amplify biases more than their closed-domain counterparts and propose that biases in the retriever surface more readily due to greater freedom of choice.

نصوص question answering models سؤال الرد النماذج التباس صناعة حمض الفوسفور

Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models

206 - Association for Computation Linguistics 2021 مقالة

Open-domain extractive question answering works well on textual data by first retrieving candidate texts and then extracting the answer from those candidates. However, some questions cannot be answered by text alone but require information stored in tables. In this paper, we present an approach for retrieving both texts and tables relevant to a question by jointly encoding texts, tables and questions into a single vector space. To this end, we create a new multi-modal dataset based on text and table datasets from related work and compare the retrieval performance of different encoding schemata. We find that dense vector embeddings of transformer models outperform sparse embeddings on four out of six evaluation datasets. Comparing different dense embedding models, tri-encoders with one encoder for each question, text and table increase retrieval performance compared to bi-encoders with one encoder for the question and one for both text and tables. We release the newly created multi-modal dataset to the community so that it can be used for training and evaluation.

tables texts الجداول نصوص صناعة حمض الفوسفور

Translationese in Russian Literary Texts

364 - Association for Computation Linguistics 2021 مقالة

The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attribu ted to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.

literary texts russian literary texts russian نصوص أدبية النصوص الأدبية الروسية الروسية صناعة حمض الفوسفور المزيد..

SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers

213 - Association for Computation Linguistics 2021 مقالة

We introduce SelfExplain, a novel self-explaining model that explains a text classifier's predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SelfExplain facilitates interpretability without sacrificing performance. Most importantly, explanations from SelfExplain show sufficiency for model predictions and are perceived as adequate, trustworthy and understandable by human judges compared to existing widely-used baselines.

neural text classifiers self-explaining architecture text classifier predictions نصوص النص العصبي بنية شرح ذاتيا تنبؤات تصنيف النص صناعة حمض الفوسفور المزيد..

EmoPars: A Collection of 30K Emotion-Annotated Persian Social Media Texts

397 - Association for Computation Linguistics 2021 مقالة

The wide reach of social media platforms, such as Twitter, have enabled many users to share their thoughts, opinions and emotions on various topics online. The ability to detect these emotions automatically would allow social scientists, as well as, businesses to better understand responses from nations and costumers. In this study we introduce a dataset of 30,000 Persian Tweets labeled with Ekman's six basic emotions (Anger, Fear, Happiness, Sadness, Hatred, and Wonder). This is the first publicly available emotion dataset in the Persian language. In this paper, we explain the data collection and labeling scheme used for the creation of this dataset. We also analyze the created dataset, showing the different features and characteristics of the data. Among other things, we investigate co-occurrence of different emotions in the dataset, and the relationship between sentiment and emotion of textual instances. The dataset is publicly available at https://github.com/nazaninsbr/Persian-Emotion-Detection.

social media texts persian social media media texts نصوص وسائل التواصل الاجتماعي وسائل التواصل الاجتماعي الفارسي نصوص وسائل الإعلام صناعة حمض الفوسفور المزيد..

A Review on Document Information Extraction Approaches

639 - Association for Computation Linguistics 2021 مقالة

Information extraction from documents has become great use of novel natural language processing areas. Most of the entity extraction methodologies are variant in a context such as medical area, financial area, also come even limited to the given lang uage. It is better to have one generic approach applicable for any document type to extract entity information regardless of language, context, and structure. Also, another issue in such research is structural analysis while keeping the hierarchical, semantic, and heuristic features. Another problem identified is that usually, it requires a massive training corpus. Therefore, this research focus on mitigating such barriers. Several approaches have been identifying towards building document information extractors focusing on different disciplines. This research area involves natural language processing, semantic analysis, information extraction, and conceptual modelling. This paper presents a review of the information extraction mechanism to construct a generic framework for document extraction with aim of providing a solid base for upcoming research.

نصوص وسائل الإعلام استخراج المعلومات صناعة حمض الفوسفور

FrenLyS: A Tool for the Automatic Simplification of French General Language Texts

205 - Association for Computation Linguistics 2021 مقالة

Lexical simplification (LS) aims at replacing words considered complex in a sentence by simpler equivalents. In this paper, we present the first automatic LS service for French, FrenLys, which offers different techniques to generate, select and rank substitutes. The paper describes the different methods proposed by our tool, which includes both classical approaches (e.g. generation of candidates from lexical resources, frequency filter, etc.) and more innovative approaches such as the exploitation of CamemBERT, a model for French based on the RoBERTa architecture. To evaluate the different methods, a new evaluation dataset for French is introduced.

general language texts french general language نصوص اللغة العامة اللغة الفرنسية اللغة العامة صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد