تم استخدام أساليب الشبكة العصبية الحديثة الأخيرة (SOTA) وأساليب Neural العصبية الفعالة على أساس النماذج المدربة مسبقا (PTM) في تجزئة الكلمات الصينية (CWS)، وتحقيق نتائج رائعة. ومع ذلك، فإن الأعمال السابقة تركز على تدريب النماذج مع Corpus الثابتة في كل تكرار. المعلومات المتوسطة المتوسطة هي أيضا قيمة. علاوة على ذلك، فإن تقلب الأساليب العصبية السابقة محدودة بالبيانات المشروح على نطاق واسع. هناك عدد قليل من الضوضاء في كوربوس المشروح. بذلت جهود محدودة من قبل الدراسات السابقة للتعامل مع هذه المشاكل. في هذا العمل، نقترح نهج CWS الخاضع للإشراف ذاتيا بمعماري مباشر وفعال. أولا، ندرب نموذج تجزئة كلمة واستخدامه لتوليد نتائج التجزئة. بعد ذلك، نستخدم نموذج لغة مصنف منقح (MLM) لتقييم جودة نتائج التجزئة المستندة إلى تنبؤات الامتيازات. أخيرا، نستفيد من التقييمات لمساعدة تدريب القطاع من خلال تحسين الحد الأدنى من التدريب على المخاطر. تظهر النتائج التجريبية أن نهجنا يتفوق على الأساليب السابقة في 9 مجموعات بيانات مختلفة CWS مع تدريب معايير واحدة وتدريب معايير متعددة وتحقيق متانة أفضل.
Recent state-of-the-art (SOTA) effective neural network methods and fine-tuning methods based on pre-trained models (PTM) have been used in Chinese word segmentation (CWS), and they achieve great results. However, previous works focus on training the models with the fixed corpus at every iteration. The intermediate generated information is also valuable. Besides, the robustness of the previous neural methods is limited by the large-scale annotated data. There are a few noises in the annotated corpus. Limited efforts have been made by previous studies to deal with such problems. In this work, we propose a self-supervised CWS approach with a straightforward and effective architecture. First, we train a word segmentation model and use it to generate the segmentation results. Then, we use a revised masked language model (MLM) to evaluate the quality of the segmentation results based on the predictions of the MLM. Finally, we leverage the evaluations to aid the training of the segmenter by improved minimum risk training. Experimental results show that our approach outperforms previous methods on 9 different CWS datasets with single criterion training and multiple criteria training and achieves better robustness.
References used
https://aclanthology.org/
Recent researches show that pre-trained models (PTMs) are beneficial to Chinese Word Segmentation (CWS). However, PTMs used in previous works usually adopt language modeling as pre-training tasks, lacking task-specific prior segmentation knowledge an
State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domain-specific rules has been shown to be useful in such settings to a
Character-based word-segmentation models have been extensively applied to agglutinative languages, including Thai, due to their high performance. These models estimate word boundaries from a character sequence. However, a character unit in sequences
Extracting relations across large text spans has been relatively underexplored in NLP, but it is particularly important for high-value domains such as biomedicine, where obtaining high recall of the latest findings is crucial for practical applicatio
Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we fo