لتفسير سلوك نموذج الاحتمالية، من المفيد قياس معايرة النموذج --- المدى الذي ينتج عنه درجات ثقة موثوقة.نحن نتطلع إلى مشكلة المعايرة المفتوحة لوضع العلامات النماذج ذات الأشكال المتناقضة، والتوصية باستراتيجيات لقياس وتقليل خطأ المعايرة (CE) في هذه النماذج.نظهر أن العديد من تقنيات إعادة التدوير بعد المخصص كلها تقلل من خطأ المعايرة عبر التوزيع الهامشي لطاغين تسلسلين موجودين.علاوة على ذلك، نقترح مجموعة تجميع الترددات (TFG) كوسيلة لقياس خطأ المعايرة في نطاقات التردد المختلفة.علاوة على ذلك، يعزز إعادة معايرة كل مجموعة بشكل منفصل تخفيض أكثر إنصافا لخطأ المعايرة عبر طيف تردد العلامات.
For interpreting the behavior of a probabilistic model, it is useful to measure a model's calibration---the extent to which it produces reliable confidence scores. We address the open problem of calibration for tagging models with sparse tagsets, and recommend strategies to measure and reduce calibration error (CE) in such models. We show that several post-hoc recalibration techniques all reduce calibration error across the marginal distribution for two existing sequence taggers. Moreover, we propose tag frequency grouping (TFG) as a way to measure calibration error in different frequency bands. Further, recalibrating each group separately promotes a more equitable reduction of calibration error across the tag frequency spectrum.
References used
https://aclanthology.org/
Commonsense reasoning benchmarks have been largely solved by fine-tuning language models. The downside is that fine-tuning may cause models to overfit to task-specific data and thereby forget their knowledge gained during pre-training. Recent works o
Large, pre-trained transformer language models, which are pervasive in natural language processing tasks, are notoriously expensive to train. To reduce the cost of training such large models, prior work has developed smaller, more compact models whic
Computational linguistic research on language change through distributional semantic (DS) models has inspired researchers from fields such as philosophy and literary studies, who use these methods for the exploration and comparison of comparatively s
Abstract Models for question answering, dialogue agents, and summarization often interpret the meaning of a sentence in a rich context and use that meaning in a new context. Taking excerpts of text can be problematic, as key pieces may not be explici
The ISO/IEC17025 International Standard for Quality and
competence Assurance for ISO/IEC Test and Calibration
Laboratories have been previously known as the ISO Guide 25, but
the current standard is ISO /IEC 17025: 2005.