Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

PHOMT: مجموعة بيانات معيار عالية الجودة وعالية المستوى للترجمة الفيتنامية-الإنجليزية

870 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

vietnamese-english machine translation benchmark vietnamese-english machine الترجمة الفيتنامية-الإنجليزية القياس الفيتنامية الآلة الإنجليزية صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs, which is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15. We conduct experiments comparing strong neural baselines and well-known automatic translation engines on our dataset and find that in both automatic and human evaluations: the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART. To our best knowledge, this is the first large-scale Vietnamese-English machine translation study. We hope our publicly available dataset and study can serve as a starting point for future research and applications on Vietnamese-English machine translation. We release our dataset at: https://github.com/VinAIResearch/PhoMT

References used

https://aclanthology.org/

rate research

Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

872 - Association for Computation Linguistics 2021 مقالة

Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consis t mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings.

لغة إزالة السموم صناعة حمض الفوسفور

A Large-Scale Dataset for Empathetic Response Generation

729 - Association for Computation Linguistics 2021 مقالة

Recent development in NLP shows a strong trend towards refining pre-trained models with a domain-specific dataset. This is especially the case for response generation where emotion plays an important role. However, existing empathetic datasets remain small, delaying research efforts in this area, for example, the development of emotion-aware chatbots. One main technical challenge has been the cost of manually annotating dialogues with the right emotion labels. In this paper, we describe a large-scale silver dataset consisting of 1M dialogues annotated with 32 fine-grained emotions, eight empathetic response intents, and the Neutral category. To achieve this goal, we have developed a novel data curation pipeline starting with a small seed of manually annotated data and eventually scaling it to a satisfactory size. We compare its quality against a state-of-the-art gold dataset using both offline experiments and visual validation methods. The resultant procedure can be used to create similar datasets in the same domain as well as in other domains.

قواعد المنطق empathetic response استجابة متعاطف صناعة حمض الفوسفور

FANG-COVID: A New Large-Scale Benchmark Dataset for Fake News Detection in German

766 - Association for Computation Linguistics 2021 مقالة

As the world continues to fight the COVID-19 pandemic, it is simultaneously fighting an infodemic' -- a flood of disinformation and spread of conspiracy theories leading to health threats and the division of society. To combat this infodemic, there i s an urgent need for benchmark datasets that can help researchers develop and evaluate models geared towards automatic detection of disinformation. While there are increasing efforts to create adequate, open-source benchmark datasets for English, comparable resources are virtually unavailable for German, leaving research for the German language lagging significantly behind. In this paper, we introduce the new benchmark dataset FANG-COVID consisting of 28,056 real and 13,186 fake German news articles related to the COVID-19 pandemic as well as data on their propagation on Twitter. Furthermore, we propose an explainable textual- and social context-based model for fake news detection, compare its performance to black-box'' models and perform feature ablation to assess the relative importance of human-interpretable features in distinguishing fake news from authentic news.

large-scale benchmark dataset benchmark dataset benchmark dataset fang-covid مجموعة البيانات القياسية واسعة النطاق معيار DataSet. معيار DataSet Fang-Covid صناعة حمض الفوسفور المزيد..

A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

825 - Association for Computation Linguistics 2021 مقالة

In this paper, we introduce a new English Twitter-based dataset for cyberbullying detection and online abuse. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabiliti es of various forms of bullying and offensive content, including insult, trolling, profanity, sarcasm, threat, porn and exclusion. We recruited a pool of 17 annotators to perform fine-grained annotation on the dataset with each tweet annotated by three annotators. All our annotators are high school educated and frequent users of social media. Inter-rater agreement for the dataset as measured by Krippendorff's Alpha is 0.67. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

online abuse detection large-scale english multi-label english multi-label twitter اكتشاف إساءة الاستخدام عبر الإنترنت الترمية الإنجليزية متعددة الواسعة الإنجليزية متعددة التسمية تويتر صناعة حمض الفوسفور المزيد..

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

849 - Association for Computation Linguistics 2021 مقالة

This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic description s as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

large-scale media interview media interview dataset large-scale media مقابلة وسائط واسعة النطاق بيانات مقابلة مع وسائل الإعلام وسائل الإعلام واسعة النطاق صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

PHOMT: مجموعة بيانات معيار عالية الجودة وعالية المستوى للترجمة الفيتنامية-الإنجليزية

Ask ChatGPT about the research

Read More

suggested questions