Research papers, master and doctoral theses about benchmarking

Benchmarking Meta-embeddings: What Works and What Does Not

288 - Association for Computation Linguistics 2021 مقالة

In the last few years, several methods have been proposed to build meta-embeddings. The general aim was to obtain new representations integrating complementary knowledge from different source pre-trained embeddings thereby improving their overall qua lity. However, previous meta-embeddings have been evaluated using a variety of methods and datasets, which makes it difficult to draw meaningful conclusions regarding the merits of each approach. In this paper we propose a unified common framework, including both intrinsic and extrinsic tasks, for a fair and objective meta-embeddings evaluation. Furthermore, we present a new method to generate meta-embeddings, outperforming previous work on a large number of intrinsic evaluation benchmarks. Our evaluation framework also allows us to conclude that previous extrinsic evaluations of meta-embeddings have been overestimated.

meta-embeddings benchmarking meta-embeddings Meta-Embeddings. معايير تايتا - Embeddings صناعة حمض الفوسفور

Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis

406 - Association for Computation Linguistics 2021 مقالة

This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach publishable'' quality and according to the number of errors they produce. For the error annotation ta sk, an original error typology for transcription errors is proposed. This study also seeks to examine whether there is a difference in the performance of these systems between native and non-native English speakers. The experimental results suggest that among the four systems, Trint obtains the best scores. It is also observed that most systems perform noticeably better with native speakers and that all systems are most prone to fluency errors.

asr systems based benchmarking asr systems benchmarking asr أنظمة العصر مقرها معيار أنظمة ASR. معيار العسر صناعة حمض الفوسفور المزيد..

What Will it Take to Fix Benchmarking in Natural Language Understanding?

200 - Association for Computation Linguistics 2021 مقالة

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

فهم التوظيف fix benchmarking إصلاح المعيار صناعة حمض الفوسفور

Dynabench: Rethinking Benchmarking in NLP

418 - Association for Computation Linguistics 2021 مقالة

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model wil l misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

rethinking benchmarking dataset creation dynabench إعادة التفكير في المعيار إنشاء DataSet. Dynabench. صناعة حمض الفوسفور المزيد..

Towards Benchmarking the Utility of Explanations for Model Debugging

311 - Association for Computation Linguistics 2021 مقالة

Post-hoc explanation methods are an important class of approaches that help understand the rationale underlying a trained model's decision. But how useful are they for an end-user towards accomplishing a given task? In this vision paper, we argue the need for a benchmark to facilitate evaluations of the utility of post-hoc explanation methods. As a first step to this end, we enumerate desirable properties that such a benchmark should possess for the task of debugging text classifiers. Additionally, we highlight that such a benchmark facilitates not only assessing the effectiveness of explanations but also their efficiency.

benchmarking the utility post-hoc explanation methods trained model decision معيار الفائدة طرق تفسير ما بعد الهوك قرار النموذج المدربين صناعة حمض الفوسفور المزيد..

Benchmarking Transformer-based Language Models for Arabic Sentiment and Sarcasm Detection

192 - Association for Computation Linguistics 2021 مقالة

The introduction of transformer-based language models has been a revolutionary step for natural language processing (NLP) research. These models, such as BERT, GPT and ELECTRA, led to state-of-the-art performance in many NLP tasks. Most of these mode ls were initially developed for English and other languages followed later. Recently, several Arabic-specific models started emerging. However, there are limited direct comparisons between these models. In this paper, we evaluate the performance of 24 of these models on Arabic sentiment and sarcasm detection. Our results show that the models achieving the best performance are those that are trained on only Arabic data, including dialectal Arabic, and use a larger number of parameters, such as the recently released MARBERT. However, we noticed that AraELECTRA is one of the top performing models while being much more efficient in its computational cost. Finally, the experiments on AraGPT2 variants showed low performance compared to BERT models, which indicates that it might not be suitable for classification tasks.

benchmarking transformer-based language transformer-based language models transformer-based language معايير اللغة القائمة على المحولات نماذج اللغة القائمة على المحولات اللغة القائمة على المحولات صناعة حمض الفوسفور المزيد..

Apply benchmarking method at Syrian banks, and the impact on the quality of banking services

1274 - Syrian Virtual University 2016 رسالة ماجستير

The research aims mainly to study the method of Benchmarking as a mean for continuous improvement of quality and the possibility of its usage in the Syrian banks, and to figure out any obstacles for such application therefore finding the right solutions.

الجودة الخدمات المصرفية المقارنة المرجعية benchmarking المصارف السورية

تطبيق أسلوب المقارنة المرجعية في المصارف السورية و أثر ذلك على تحسين جودة الخدمات المصرفية

1066 - Aِl-Baath University 2016 ورقة بحثية

The research aims mainly to study the method of Benchmarking and the possibility of its usage in the Syrian banks, and to figure out any obstacles for such application therefore finding the right solutions, and for this goal the researcher did the following, Conduction of an extensive theoretical study.

العمليات المصرفية المقارنة المرجعية الجودة في المصارف أدوات الجودة benchmarking quality in banks banking operations quality tools المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد