يعد تعيين المعاني المعجمية إلى WordForms ميزة رئيسية للغات الطبيعية. في حين أن ضغوط الاستخدام قد تقوم بتعيين كلمات قصيرة معاني متكررة (قانون اختصار Zipf)، فإن الحاجة إلى مفردات إنتاجية ومفتوحة، وقيود محلية على تسلسل الرموز، وعوامل أخرى مختلفة جميعها تشكل طمئتي لغات العالم. على الرغم من أهميتها في تشكيل الهيكل المعجمي، لم يتم تحديد المساهمات النسبية لهذه العوامل بالكامل. أخذ رؤية نظرية ترميز من المعجم والاستفادة من نموذج إحصائي عام جديد، نحدد الحدود العليا لضغوط المعجم تحت قيود مختلفة. فحص كوربورا من 7 لغات متنوعة من 7، نستخدم تلك الحدود العليا لتحديد فائنة المعجم واستكشاف التكاليف النسبية للقيود الرئيسية على الرموز الطبيعية. نجد أن التورفولوجيا (التركيبية) والرسومات الحربية يمكن أن يمثل بما فيه الكفاية لمعظم تعقيد الرموز الطبيعية --- كما تقاس طول التعليمات البرمجية.
The mapping of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf's law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world's languages. Despite their importance in shaping lexical structure, the relative contributions of these factors have not been fully quantified. Taking a coding-theoretic view of the lexicon and making use of a novel generative statistical model, we define upper bounds for the compressibility of the lexicon under various constraints. Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon's optimality and to explore the relative costs of major constraints on natural codes. We find that (compositional) morphology and graphotactics can sufficiently account for most of the complexity of natural codes---as measured by code length.
References used
https://aclanthology.org/
Precise information of word boundary can alleviate the problem of lexical ambiguity to improve the performance of natural language processing (NLP) tasks. Thus, Chinese word segmentation (CWS) is a fundamental task in NLP. Due to the development of p
Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Most prior work on ML based lemmatization has focused on high resource languages, where data sets (word forms) are readily availabl
The paper presents experiments in neural machine translation with lexical constraints into a morphologically rich language. In particular and we introduce a method and based on constrained decoding and which handles the inflected forms of lexical ent
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social medi
Natural conversations are filled with disfluencies. This study investigates if and how BERT understands disfluency with three experiments: (1) a behavioural study using a downstream task, (2) an analysis of sentence embeddings and (3) an analysis of