New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Itihasa: A large-scale corpus for Sanskrit to English translation

Itihasa: كوربوس واسعة النطاق ل Sanskrit to الإنجليزية Translation

329 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

work introduces itihasa english translation large-scale translation dataset العمل يقدم Itihasa. الترجمة إلى الإنجليزية مجموعة بيانات الترجمة على نطاق واسع صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

References used

https://aclanthology.org/

rate research

A Large-Scale Study of Machine Translation in Turkic Languages

327 - Association for Computation Linguistics 2021 مقالة

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.

تقدير الجودة على مستوى الجملة صناعة حمض الفوسفور

A Large-Scale Dataset for Empathetic Response Generation

322 - Association for Computation Linguistics 2021 مقالة

Recent development in NLP shows a strong trend towards refining pre-trained models with a domain-specific dataset. This is especially the case for response generation where emotion plays an important role. However, existing empathetic datasets remain small, delaying research efforts in this area, for example, the development of emotion-aware chatbots. One main technical challenge has been the cost of manually annotating dialogues with the right emotion labels. In this paper, we describe a large-scale silver dataset consisting of 1M dialogues annotated with 32 fine-grained emotions, eight empathetic response intents, and the Neutral category. To achieve this goal, we have developed a novel data curation pipeline starting with a small seed of manually annotated data and eventually scaling it to a satisfactory size. We compare its quality against a state-of-the-art gold dataset using both offline experiments and visual validation methods. The resultant procedure can be used to create similar datasets in the same domain as well as in other domains.

قواعد المنطق empathetic response استجابة متعاطف صناعة حمض الفوسفور

Large-Scale English-Japanese Simultaneous Interpretation Corpus: Construction and Analyses with Sentence-Aligned Data

265 - Association for Computation Linguistics 2021 مقالة

This paper describes the construction of a new large-scale English-Japanese Simultaneous Interpretation (SI) corpus and presents the results of its analysis. A portion of the corpus contains SI data from three interpreters with different amounts of e xperience. Some of the SI data were manually aligned with the source speeches at the sentence level. Their latency, quality, and word order aspects were compared among the SI data themselves as well as against offline translations. The results showed that (1) interpreters with more experience controlled the latency and quality better, and (2) large latency hurt the SI quality.

english-japanese simultaneous interpretation large-scale english-japanese simultaneous simultaneous interpretation corpus الترجمة الفورية الإنجليزية اليابانية في وقت واحد واسعة النطاق الإنجليزية اليابانية في وقت واحد تفسير في وقت واحد كوربوس صناعة حمض الفوسفور المزيد..

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

573 - Association for Computation Linguistics 2021 مقالة

Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

تشذيب مجموعة فرعية high-quality machine translation ترجمة آلة عالية الجودة صناعة حمض الفوسفور

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

442 - Association for Computation Linguistics 2021 مقالة

This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic description s as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

large-scale media interview media interview dataset large-scale media مقابلة وسائط واسعة النطاق بيانات مقابلة مع وسائل الإعلام وسائل الإعلام واسعة النطاق صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Itihasa: A large-scale corpus for Sanskrit to English translation

Itihasa: كوربوس واسعة النطاق ل Sanskrit to الإنجليزية Translation

Ask ChatGPT about the research

Read More

suggested questions