نقوم بتطوير نموذج يشرف على الحد الأدنى لتصحيح الإملاء وتقييم أدائه على ثلاثة مجموعات البيانات المشروح عن أخطاء الإملاء باللغة الروسية.أول كوربوس هي مجموعة بيانات من بيانات وسائل التواصل الاجتماعي الروسية المستخدمة مؤخرا في مهمة مشتركة على تصحيح الإملاء الروسي.تحتوي شرطيان آخرين على نصوص ينتجها متعلمو الروسية بلغة أجنبية.التقييم على ثلاثة مجموعات بيانات متنوعة تسمح بمقارنة شوكية.نقارن أداء النموذج الذي يجرأ الحد الأدنى إلى نماذج خطية خطية لا تستخدم السياق لإعادة صياغة المرشح، وكذلك نظام الترجمة ذات المستوى الإحصائي على مستوى الأحرف مع إعادة التعريف المستندة إلى السياق.نظهر أن النموذج الذي يجروح تحت إشرافه يفوق جميع النماذج الأخرى.نقدم أيضا تحليلا للأخطاء الإملائية ونناقش صعوبة المهمة مقارنة بمشكلة تصحيح الإملاء باللغة الإنجليزية.
We develop a minimally-supervised model for spelling correction and evaluate its performance on three datasets annotated for spelling errors in Russian. The first corpus is a dataset of Russian social media data that was recently used in a shared task on Russian spelling correction. The other two corpora contain texts produced by learners of Russian as a foreign language. Evaluating on three diverse datasets allows for a cross-corpus comparison. We compare the performance of the minimally-supervised model to two baseline models that do not use context for candidate re-ranking, as well as to a character-level statistical machine translation system with context-based re-ranking. We show that the minimally-supervised model outperforms all of the other models. We also present an analysis of the spelling errors and discuss the difficulty of the task compared to the spelling correction problem in English.
References used
https://aclanthology.org/
We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval. Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods, while the previous work either
GECko+ : a Grammatical and Discourse Error Correction Tool We introduce GECko+, a web-based writing assistance tool for English that corrects errors both at the sentence and at the discourse level. It is based on two state-of-the-art models for gramm
Grammatical error correction (GEC) suffers from a lack of sufficient parallel data. Studies on GEC have proposed several methods to generate pseudo data, which comprise pairs of grammatical and artificially produced ungrammatical sentences. Currently
State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, wh
The objective of the research is to complete a theoretical and practical study related to
coastal marine works in order to calculate the amounts of silt removal from harbor basins
and entrances, and to present the methods and devices used in the pe