النمط هو جزء لا يتجزأ من اللغة الطبيعية.ومع ذلك، فإن أساليب التقييم لتدابير النمط نادرة، وغالبا ما تكون المهام الخاصة وعادة ما لا تتحكم في المحتوى.نقترح إطار تقييم النمط المعياري والحبوب المحتوى ومقره المحتوى (STEL) لاختبار أداء أي نموذج يمكن مقارنة جملتين على النمط.نحن نوضح ستيل مع أبعاد عامين من النمط (رسمي / غير رسمي وبسيط / معقد) بالإضافة إلى خصائصين محددة للأسلوب (Contrac'tion and Numb3r البديلة).نجد أن الأساليب القائمة على BERT تفوق إصدارات بسيطة من تدابير النمط الشائعة الاستخدام مثل 3 غرامات وترقيب الترقيم والنهج القائمة على LIWC.نحن ندعو إضافة مهام أخرى وثيمات مهمة إلى ستيل ونأمل في تسهيل تحسين التدابير الحساسة للنمط.
Style is an integral part of natural language. However, evaluation methods for style measures are rare, often task-specific and usually do not control for content. We propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL) to test the performance of any model that can compare two sentences on style. We illustrate STEL with two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contrac'tion and numb3r substitution). We find that BERT-based methods outperform simple versions of commonly used style measures like 3-grams, punctuation frequency and LIWC-based approaches. We invite the addition of further tasks and task instances to STEL and hope to facilitate the improvement of style-sensitive measures.
References used
https://aclanthology.org/
This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be t
While the field of style transfer (ST) has been growing rapidly, it has been hampered by a lack of standardized practices for automatic evaluation. In this paper, we evaluate leading automatic metrics on the oft-researched task of formality style tra
This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from al
While the DisCoCat model (Coecke et al., 2010) has been proved a valuable tool for studying compositional aspects of language at the level of semantics, its strong dependency on pregroup grammars poses important restrictions: first, it prevents large
In recent years, remote digital healthcare using online chats has gained momentum, especially in the Global South. Though prior work has studied interaction patterns in online (health) forums, such as TalkLife, Reddit and Facebook, there has been lim