الأساليب السابقة لتجزئة النص هي في الغالب على مستوى الرمز المميز.على الرغم من الكفاية، تحد هذه الطبيعة من إمكاناتها الكاملة لالتقاط التبعيات طويلة الأجل بين القطاعات.في هذا العمل، نقترح إطارا جديدا يزدر أدرج جمل اللغة الطبيعية في مستوى القطاع.لكل خطوة في تجزئة، يعترف الجزء الأكبر في أقصى اليسار من التسلسل المتبقي.تنطوي التطبيقات على تقنية LSTM-ناقص لبناء تمثيل العبارات والشبكات العصبية المتكررة (RNN) لنموذج تكرارات تحديد الأقصى اليمنى.لقد أجرينا تجارب واسعة النطاق على العلامات على الجزء العلوي من قطع البيانات والصينية (POS) عبر 3 مجموعات من مجموعات البيانات، مما يدل على أن أساليبنا تتفوق بشكل كبير على جميع خطوط الأساس السابقة وحققت نتائج جديدة من الفنادق الجديدة.علاوة على ذلك، فإن التحليل النوعي والدراسة حول تجزئة الجمل الطويلة الطويلة تحقق من فعاليته في نمذجة التبعيات طويلة الأجل.
Prior methods to text segmentation are mostly at token level. Despite the adequacy, this nature limits their full potential to capture the long-term dependencies among segments. In this work, we propose a novel framework that incrementally segments natural language sentences at segment level. For every step in segmentation, it recognizes the leftmost segment of the remaining sequence. Implementations involve LSTM-minus technique to construct the phrase representations and recurrent neural networks (RNN) to model the iterations of determining the leftmost segments. We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech (POS) tagging across 3 datasets, demonstrating that our methods have significantly outperformed previous all baselines and achieved new state-of-the-art results. Moreover, qualitative analysis and the study on segmenting long-length sentences verify its effectiveness in modeling long-term dependencies.
References used
https://aclanthology.org/
After a neural sequence model encounters an unexpected token, can its behavior be predicted? We show that RNN and transformer language models exhibit structured, consistent generalization in out-of-distribution contexts. We begin by introducing two i
Abstract We introduce a theoretical framework for understanding and predicting the complexity of sequence classification tasks, using a novel extension of the theory of Boolean function sensitivity. The sensitivity of a function, given a distribution
Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we fo
This paper presents our approach to address the EACL WANLP-2021 Shared Task 1: Nuanced Arabic Dialect Identification (NADI). The task is aimed at developing a system that identifies the geographical location(country/province) from where an Arabic twe
Sequence-to-sequence models have been applied to a wide variety of NLP tasks, but how to properly use them for dialogue state tracking has not been systematically investigated. In this paper, we study this problem from the perspectives of pre-trainin