تقليديا، تم حل مشاكل نقل مستوى الأحرف مع طرازات الحالة المحدودة المصممة لتشفير المعرفة الهيكلية واللغوية بالعملية الأساسية، في حين أن النهج الحديثة تعتمد على قوة ومرونة نماذج التسلسل إلى التسلسل مع الاهتمام.التركيز على سيناريو التعلم الأقل استكشاف غير مخالفات، قارن الفصولتين النموذجيين جنبا إلى جنب وتجد أنهم يميلون إلى جعل أنواع مختلفة من الأخطاء حتى عند تحقيق أداء مماثل.نقوم بتحليل توزيعات فئات الأخطاء المختلفة باستخدام مهامين غير مدفوعين كمثبتين: تحويل النص الرباني بشكل غير رسمي إلى البرنامج النصي الأصلي لغته (للروسية والعربية وكانيا) وترجم بين زوج من اللغات ذات الصلة عن كثب (الصربية والبوسني).أخيرا، نقوم بالتحقيق في كيفية دمج نماذج الحالة المحدودة والتسلسل في وقت فك التشفير يؤثر على الإخراج الكمي والنوعي.
Traditionally, character-level transduction problems have been solved with finite-state models designed to encode structural and linguistic knowledge of the underlying process, whereas recent approaches rely on the power and flexibility of sequence-to-sequence models with attention. Focusing on the less explored unsupervised learning scenario, we compare the two model classes side by side and find that they tend to make different types of errors even when achieving comparable performance. We analyze the distributions of different error classes using two unsupervised tasks as testbeds: converting informally romanized text into the native script of its language (for Russian, Arabic, and Kannada) and translating between a pair of closely related languages (Serbian and Bosnian). Finally, we investigate how combining finite-state and sequence-to-sequence models at decoding time affects the output quantitatively and qualitatively.
References used
https://aclanthology.org/
State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, wh
Abstract The metrics standardly used to evaluate Natural Language Generation (NLG) models, such as BLEU or METEOR, fail to provide information on which linguistic factors impact performance. Focusing on Surface Realization (SR), the task of convertin
This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization. Our system is based on a BERT token classification preprocessing step, where for each token the type of the necessary transformation i
Large pretrained language models using the transformer neural network architecture are becoming a dominant methodology for many natural language processing tasks, such as question answering, text classification, word sense disambiguation, text comple
Grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs for training, but obtaining such annotation can be prohibitively expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong