نبلغ عن تقديمنا إلى المهمة 1 من تحدي جيرفال 2021 - تصنيف التعليق السام.نحقق في طرق مختلفة لتعزيز البيانات التدريبية النادرة لتحسين الأداء النموذجي خارج الرف على مهمة تصنيف سامة للتعليق.للمساعدة في معالجة قيود مجموعة بيانات صغيرة، نستخدم البيانات التي تم إنشاؤها مزخرف بواسطة نموذج GPT-2 الألماني.إن استخدام البيانات الاصطناعية لم تقلع مؤخرا كحل محتمل لبيانات التدريب التدريجي في مجال التصديق في NLP، والنتائج الأولية تعد.ومع ذلك، لم ير نموذجنا تحسنا قياسيا من خلال استخدام البيانات الاصطناعية.نناقش أسباب محتملة لهذا الاكتشاف واستكشاف الأعمال المستقبلية في هذا المجال.
We report on our submission to Task 1 of the GermEval 2021 challenge -- toxic comment classification. We investigate different ways of bolstering scarce training data to improve off-the-shelf model performance on a toxic comment classification task. To help address the limitations of a small dataset, we use data synthetically generated by a German GPT-2 model. The use of synthetic data has only recently been taking off as a possible solution to ad- dressing training data sparseness in NLP, and initial results are promising. However, our model did not see measurable improvement through the use of synthetic data. We discuss possible reasons for this finding and explore future works in the field.
References used
https://aclanthology.org/
In this work, we present our approaches on the toxic comment classification task (subtask 1) of the GermEval 2021 Shared Task. For this binary task, we propose three models: a German BERT transformer model; a multilayer perceptron, which was first tr
This paper describes our contribution to SemEval 2021 Task 1 (Shardlow et al., 2021): Lexical Complexity Prediction. In our approach, we leverage the ELECTRA model and attempt to mirror the data annotation scheme. Although the task is a regression ta
Evaluating the complexity of a target word in a sentential context is the aim of the Lexical Complexity Prediction task at SemEval-2021. This paper presents the system created to assess single words lexical complexity, combining linguistic and psycho
In this paper we investigate the efficacy of using contextual embeddings from multilingual BERT and German BERT in identifying fact-claiming comments in German on social media. Additionally, we examine the impact of formulating the classification pro
This article introduces the system description of the hub team, which explains the related work and experimental results of our team's participation in SemEval 2021 Task 5: Toxic Spans Detection. The data for this shared task comes from some posts on