تصف هذه الورقة التقديمات HEL-LJU إلى المهمة المشتركة متعددة الأبعاد على التطبيع المعجمي متعدد اللغات.يعتمد نظامنا على خطوة مسبقة تصنيف صفقة Bert Token، حيث يتم توقع كل رمزي نوع التحول الضروري (لا شيء، أحرف كبيرة، صغيرة، كاستفغل، تعديل)، وخطوة SMT على مستوى الأحرف حيث يتم ترجمة النص من الأصلللتطبيع بالنظر إلى قيود التحول المتوقع بيرت.بالنسبة لبعض اللغات، اعتمادا على نتائج بيانات التنمية، تم تمديد بيانات التدريب من قبل بيانات OpenSubtitles مرة أخرى.في الترتيب النهائي للفرق العشرة المشاركة، اتخذ فريق هيل-LJU المركز الثاني، حيث سجل أفضل من الحالة السابقة من بين الفن.
This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization. Our system is based on a BERT token classification preprocessing step, where for each token the type of the necessary transformation is predicted (none, uppercase, lowercase, capitalize, modify), and a character-level SMT step where the text is translated from original to normalized given the BERT-predicted transformation constraints. For some languages, depending on the results on development data, the training data was extended by back-translating OpenSubtitles data. In the final ordering of the ten participating teams, the HEL-LJU team has taken the second place, scoring better than the previous state-of-the-art.
References used
https://aclanthology.org/
The task of converting a nonstandard text to a standard and readable text is known as lexical normalization. Almost all the Natural Language Processing (NLP) applications require the text data in normalized form to build quality task-specific models.
Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for soci
We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solut
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP m
Recent research has adopted a new experimental field centered around the concept of text perturbations which has revealed that shuffled word order has little to no impact on the downstream performance of Transformer-based language models across many