Research papers, master and doctoral theses about vocabulary

Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models

339 - Association for Computation Linguistics 2021 مقالة

This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definit ion, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.

revisiting open-vocabulary capabilities revisiting open-vocabulary open-vocabulary capabilities إعادة النظر في قدرات المفردات المفتوحة إعادة النظر في المفردات المفتوحة قدرات المفردات المفتوحة صناعة حمض الفوسفور المزيد..

AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain

194 - Association for Computation Linguistics 2021 مقالة

During the fine-tuning phase of transfer learning, the pretrained vocabulary remains unchanged, while model parameters are updated. The vocabulary generated based on the pretrained data is suboptimal for downstream data when domain discrepancy exists . We propose to consider the vocabulary as an optimizable parameter, allowing us to update the vocabulary by expanding it with domain specific vocabulary based on a tokenization statistic. Furthermore, we preserve the embeddings of the added words from overfitting to downstream data by utilizing knowledge learned from a pretrained language model with a regularization term. Our method achieved consistent performance improvements on diverse domains (i.e., biomedical, computer science, news, and reviews).

strategy for adapting adapting vocabulary vocabulary استراتيجية التكيف تكييف المفردات صناعة حمض الفوسفور

Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training

273 - Association for Computation Linguistics 2021 مقالة

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited voca bulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

allocating large vocabulary large vocabulary capacity allocating large تخصيص المفردات الكبيرة قدرة المفردات الكبيرة تخصيص كبير صناعة حمض الفوسفور المزيد..

Robust Open-Vocabulary Translation from Visual Text Representations

153 - Association for Computation Linguistics 2021 مقالة

Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an open vocabulary.' This approach relies on consistent and correct underlying unicode sequences, and makes models susceptible to degrad ation from common types of noise and variation. Motivated by the robustness of human language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created by processing visually rendered text with sliding windows. We show that models using visual text representations approach or match performance of traditional text models on small and larger datasets. More importantly, models with visual embeddings demonstrate significant robustness to varied types of noise, achieving e.g., 25.9 BLEU on a character permuted German--English task where subword models degrade to 1.9.

robust open-vocabulary translation visual text representations robust open-vocabulary ترجمة متفوعة قوية تمثيل النص المرئي قوية المتفردات صناعة حمض الفوسفور المزيد..

Word Discriminations for Vocabulary Inventory Prediction

204 - Association for Computation Linguistics 2021 مقالة

The aim of vocabulary inventory prediction is to predict a learner's whole vocabulary based on a limited sample of query words. This paper approaches the problem starting from the 2-parameter Item Response Theory (IRT) model, giving each word in the vocabulary a difficulty and discrimination parameter. The discrimination parameter is evaluated on the sub-problem of question item selection, familiar from the fields of Computerised Adaptive Testing (CAT) and active learning. Next, the effect of the discrimination parameter on prediction performance is examined, both in a binary classification setting, and in an information retrieval setting. Performance is compared with baselines based on word frequency. A number of different generalisation scenarios are examined, including generalising word difficulty and discrimination using word embeddings with a predictor network and testing on out-of-dataset data.

vocabulary inventory prediction inventory prediction vocabulary inventory التنبؤ بمفردات المخزون التنبؤ الجرد مخزون المفردات صناعة حمض الفوسفور المزيد..

Game-theoretic Vocabulary Selection via the Shapley Value and Banzhaf Index

118 - Association for Computation Linguistics 2021 مقالة

The input vocabulary and the representations learned are crucial to the performance of neural NLP models. Using the full vocabulary results in less explainable and more memory intensive models, with the embedding layer often constituting the majority of model parameters. It is thus common to use a smaller vocabulary to lower memory requirements and construct more interpertable models. We propose a vocabulary selection method that views words as members of a team trying to maximize the model's performance. We apply power indices from cooperative game theory, including the Shapley value and Banzhaf index, that measure the relative importance of individual team members in accomplishing a joint task. We approximately compute these indices to identify the most influential words. Our empirical evaluation examines multiple NLP tasks, including sentence and document classification, question answering and textual entailment. We compare to baselines that select words based on frequency, TF-IDF and regression coefficients under L1 regularization, and show that this game-theoretic vocabulary selection outperforms all baseline on a range of different tasks and datasets.

banzhaf index game-theoretic vocabulary selection vocabulary selection مؤشر بنزاف لعبة اختيار المفردات النظرية اختيار المفردات صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد