Data Science Kitchen at GermEval 2021: A Fine Selection of Hand-Picked Features, Delivered Fresh from the Oven

111 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Christopher Schymura

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Niclas Hildebrandt - Benedikt Boenninghoff - Dennis Orth andn Christopher Schymura

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper presents the contribution of the Data Science Kitchen at GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments. The task aims at extending the identification of offensive language, by including additional subtasks that identify comments which should be prioritized for fact-checking by moderators and community managers. Our contribution focuses on a feature-engineering approach with a conventional classification backend. We combine semantic and writing style embeddings derived from pre-trained deep neural networks with additional numerical features, specifically designed for this task. Ensembles of Logistic Regression classifiers and Support Vector Machines are used to derive predictions for each subtask via a majority voting scheme. Our best submission achieved macro-averaged F1-scores of 66.8%, 69.9% and 72.5% for the identification of toxic, engaging, and fact-claiming comments.

قيم البحث

68 - Skye Morgan , Tharindu Ranasinghe , Marcos Zampieri 2021

This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval-2021 shared task containing over 3,000 manually annotated Facebook comments i n German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning. Our results indicate that multitask learning achieves performance superior to the more common single task learning approach in all three tasks. We submit our best systems to GermEval-2021 under the team name WLV-RIT.

الحساب واللغة الذكاء الاصطناعي الحوسبة العصبية والتطورية

A Fresh Look at the Hot Hand Paradox

144 - S. Redner 2019

We discuss the hot hand paradox within the framework of the backward Kolmogorov equation. We use this approach to understand the apparently paradoxical features of the statistics of fixed-length sequences of heads and tails upon repeated fair coin fl ips. In particular, we compute the average waiting time for the appearance of specific sequences. For sequences of length 2, the average time until the appearance of the sequence HH (heads-heads) equals 6, while the waiting time for the sequence HT (heads-tails) equals 4. These results require a few simple calculational steps by the Kolmogorov approach. We also give complete results for sequences of lengths 3, 4, and 5; the extension to longer sequences is straightforward (albeit more tedious). Finally, we compute the waiting times $T_{nrm H}$ for an arbitrary length sequences of all heads and $T_{nrm(HT)}$ for the sequence of alternating heads and tails. For large $n$, $T_{2nrm H}sim 3 T_{nrm(HT)}$.

الفيزياء الشعبية الميكانيكا الإحصائية تاريخ الرياضيات

FH-SWF SG at GermEval 2021: Using Transformer-Based Language Models to Identify Toxic, Engaging, & Fact-Claiming Comments

67 - Christian Gawron , Sebastian Schmidt 2021

In this paper we describe the methods we used for our submissions to the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments. For all three subtasks we fine-tuned freely available transformer-based models fr om the Huggingface model hub. We evaluated the performance of various pre-trained models after fine-tuning on 80% of the training data with different hyperparameters and submitted predictions of the two best performing resulting models. We found that this approach worked best for subtask 3, for which we achieved an F1-score of 0.736.

الحساب واللغة

TransWiC at SemEval-2021 Task 2: Transformer-based Multilingual and Cross-lingual Word-in-Context Disambiguation

72 - Hansi Hettiarachchi , Tharindu Ranasinghe 2021

Identifying whether a word carries the same meaning or different meaning in two contexts is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisati on, information retrieval and information extraction. Most of the previous work in this area rely on language-specific resources making it difficult to generalise across languages. Considering this limitation, our approach to SemEval-2021 Task 2 is based only on pretrained transformer models and does not use any language-specific processing and resources. Despite that, our best model achieves 0.90 accuracy for English-English subtask which is very compatible compared to the best result of the subtask; 0.93 accuracy. Our approach also achieves satisfactory results in other monolingual and cross-lingual language pairs as well.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Evaluating the Utility of Hand-crafted Features in Sequence Labelling

70 - Minghao Wu , Fei Liu , Trevor Cohn 2018

Conventional wisdom is that hand-crafted features are redundant for deep learning models, as they already learn adequate representations of text automatically from corpora. In this work, we test this claim by proposing a new method for exploiting han dcrafted features as part of a novel hybrid learning approach, incorporating a feature auto-encoder loss component. We evaluate on the task of named entity recognition (NER), where we show that including manual features for part-of-speech, word shapes and gazetteers can improve the performance of a neural CRF model. We obtain a $F_1$ of 91.89 for the CoNLL-2003 English shared task, which significantly outperforms a collection of highly competitive baseline models. We also present an ablation study showing the importance of auto-encoding, over using features as either inputs or outputs alone, and moreover, show including the autoencoder components reduces training requirements to 60%, while retaining the same predictive accuracy.

الحساب واللغة