تجزئة الكلمات، مشكلة إيجاد حدود الكلمات في الكلام، تهم مجموعة من المهام.اقترحت الأوراق السابقة أن نماذج تسلسل إلى تسلسل تدربت على مهام مثل ترجمة الكلام أو التعرف على الكلام، ويمكن استخدام الاهتمام لتحديد الكلمات والجزء.ومع ذلك، نوضح ذلك حتى على بيانات أحادية النظرة هشة.في تجاربنا ذات أنواع المدخلات المختلفة، أحجام البيانات، وخوارزميات تجزئة، فقط النماذج المدربة على التنبؤ بالهواتف من الكلمات تنجح في المهمة.النماذج المدربة للتنبؤ بالكلف من الهواتف أو الكلام (أي، الاتجاه المعاكس الذي يحتاج إلى تعميم البيانات الجديدة)، يؤدي إلى نتائج أسوأ بكثير، مما يشير إلى أن التجزئة القائمة على الانتباه مفيد فقط في سيناريوهات محدودة.
Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition, attention can be used to locate and segment the words. We show, however, that even on monolingual data this approach is brittle. In our experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task. Models trained to predict words from either phones or speech (i.e., the opposite direction needed to generalize to new data), yield much worse results, suggesting that attention-based segmentation is only useful in limited scenarios.
References used
https://aclanthology.org/
Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to s
This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment errors by rel
Automatic summarization aims to extract important information from large amounts of textual data in order to create a shorter version of the original texts while preserving its information. Training traditional extractive summarization models relies
Abstract Identifying factors that make certain languages harder to model than others is essential to reach language equality in future Natural Language Processing technologies. Free-order case-marking languages, such as Russian, Latin, or Tamil, have
Character-based word-segmentation models have been extensively applied to agglutinative languages, including Thai, due to their high performance. These models estimate word boundaries from a character sequence. However, a character unit in sequences