Demo of Sanskrit-Hindi SMT System

65 0 0.0 ( 0 )

Download Cite

Added by Atul Kr. Ojha Mr.

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Rajneesh Pandey - Atul Kr. Ojha - Girish Nath Jha

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The demo proposal presents a Phrase-based Sanskrit-Hindi (SaHiT) Statistical Machine Translation system. The system has been developed on Moses. 43k sentences of Sanskrit-Hindi parallel corpus and 56k sentences of a monolingual corpus in the target language (Hindi) have been used. This system gives 57 BLEU score.

rate research

Automatic Language Identification System for Hindi and Magahi

128 - Priya Rani , Atul Kr. Ojha , Girish Nath Jha 2018

Language identification has become a prerequisite for all kinds of automated text processing systems. In this paper, we present a rule-based language identifier tool for two closely related Indo-Aryan languages: Hindi and Magahi. This system has currently achieved an accuracy of approx 86.34%. We hope to improve this in the future. Automatic identification of languages will be significant in the accuracy of output of Web Crawlers.

Computation and Language

English-Bhojpuri SMT System: Insights from the Karaka Model

115 - Atul Kr. Ojha 2019

This thesis has been divided into six chapters namely: Introduction, Karaka Model and it impacts on Dependency Parsing, LT Resources for Bhojpuri, English-Bhojpuri SMT System: Experiment, Evaluation of EB-SMT System, and Conclusion. Chapter one introduces this PhD research by detailing the motivation of the study, the methodology used for the study and the literature review of the existing MT related work in Indian Languages. Chapter two talks of the theoretical background of Karaka and Karaka model. Along with this, it talks about previous related work. It also discusses the impacts of the Karaka model in NLP and dependency parsing. It compares Karaka dependency and Universal Dependency. It also presents a brief idea of the implementation of these models in the SMT system for English-Bhojpuri language pair.

Computation and Language

Evaluating Neural Word Embeddings for Sanskrit

68 - Jivnesh Sandhan , Om Adideva , Digumarthi Komal 2021

Recently, the supervised learning paradigms surprisingly remarkable performance has garnered considerable attention from Sanskrit Computational Linguists. As a result, the Sanskrit community has put laudable efforts to build task-specific labeled data for various downstream Natural Language Processing (NLP) tasks. The primary component of these approaches comes from representations of word embeddings. Word embedding helps to transfer knowledge learned from readily available unlabelled data for improving task-specific performance in low-resource setting. Last decade, there has been much excitement in the field of digitization of Sanskrit. To effectively use such readily available resources, it is very much essential to perform a systematic study on word embedding approaches for the Sanskrit language. In this work, we investigate the effectiveness of word embeddings. We classify word embeddings in broad categories to facilitate systematic experimentation and evaluate them on four intrinsic tasks. We investigate the efficacy of embeddings approaches (originally proposed for languages other than Sanskrit) for Sanskrit along with various challenges posed by language.

Computation and Language

Hostility Detection Dataset in Hindi

268 - Mohit Bhardwaj , Md Shad Akhtar , Asif Ekbal 2020

In this paper, we present a novel hostility detection dataset in Hindi language. We collect and manually annotate ~8200 online posts. The annotated dataset covers four hostility dimensions: fake news, hate speech, offensive, and defamation posts, along with a non-hostile label. The hostile posts are also considered for multi-label tags due to a significant overlap among the hostile classes. We release this dataset as part of the CONSTRAINT-2021 shared task on hostile post detection.

Computation and Language

Bollyrics: Automatic Lyrics Generator for Romanised Hindi

104 - Naman Jain , Ankush Chauhan , Atharva Chewale 2020

Song lyrics convey a meaningful story in a creative manner with complex rhythmic patterns. Researchers have been successful in generating and analyisng lyrics for poetry and songs in English and Chinese. But there are no works which explore the Hindi language datasets. Given the popularity of Hindi songs across the world and the ambiguous nature of romanized Hindi script, we propose Bollyrics, an automatic lyric generator for romanized Hindi songs. We propose simple techniques to capture rhyming patterns before and during the model training process in Hindi language. The dataset and codes are available publicly at https://github.com/lingo-iitgn/Bollyrics.

Computation and Language