يقدم هذا العمل ITIHASA، مجموعة بيانات ترجمة واسعة النطاق تحتوي على 93،000 زوج من Sanskrit Shlokas وترجماتها الإنجليزية.يتم استخراج شلوكاس من اثنين من الملصفات الهندية بمعنى.، رامايانا وماهاوصفنا أولا الدافع وراء عمالة مثل هذه البيانات ومتابعة التحليل التجريبي لإظهار الفروق الدقيقة.ثم نقاشنا بعد أداء نماذج الترجمة القياسية في هذه الجثة وإظهار أنه حتى بديهيات المحولات الحديثة تؤدي بشكل سيء، مع التركيز على تعقيد مجموعة البيانات.
This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
References used
https://aclanthology.org/
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are
Recent development in NLP shows a strong trend towards refining pre-trained models with a domain-specific dataset. This is especially the case for response generation where emotion plays an important role. However, existing empathetic datasets remain
This paper describes the construction of a new large-scale English-Japanese Simultaneous Interpretation (SI) corpus and presents the results of its analysis. A portion of the corpus contains SI data from three interpreters with different amounts of e
Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research
This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic description