Automatic Language Identification System for Hindi and Magahi

129 0 0.0 ( 0 )

Download Cite

Added by Atul Kr. Ojha Mr.

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Priya Rani - Atul Kr. Ojha - Girish Nath Jha

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Language identification has become a prerequisite for all kinds of automated text processing systems. In this paper, we present a rule-based language identifier tool for two closely related Indo-Aryan languages: Hindi and Magahi. This system has currently achieved an accuracy of approx 86.34%. We hope to improve this in the future. Automatic identification of languages will be significant in the accuracy of output of Web Crawlers.

rate research

Bollyrics: Automatic Lyrics Generator for Romanised Hindi

104 - Naman Jain , Ankush Chauhan , Atharva Chewale 2020

Song lyrics convey a meaningful story in a creative manner with complex rhythmic patterns. Researchers have been successful in generating and analyisng lyrics for poetry and songs in English and Chinese. But there are no works which explore the Hindi language datasets. Given the popularity of Hindi songs across the world and the ambiguous nature of romanized Hindi script, we propose Bollyrics, an automatic lyric generator for romanized Hindi songs. We propose simple techniques to capture rhyming patterns before and during the model training process in Hindi language. The dataset and codes are available publicly at https://github.com/lingo-iitgn/Bollyrics.

Computation and Language

All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

54 - Jasabanta Patro , Bidisha Samanta , Saurabh Singh 2017

In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

Computation and Language

Automatic Parallel Corpus Creation for Hindi-English News Translation Task

70 - Aditya Kumar Pathak , Priyankit Acharya , Dilpreet Kaur 2019

The parallel corpus for multilingual NLP tasks, deep learning applications like Statistical Machine Translation Systems is very important. The parallel corpus of Hindi-English language pair available for news translation task till date is of very limited size as per the requirement of the systems are concerned. In this work we have developed an automatic parallel corpus generation system prototype, which creates Hindi-English parallel corpus for news translation task. Further to verify the quality of generated parallel corpus we have experimented by taking various performance metrics and the results are quite interesting.

Computation and Language

Demo of Sanskrit-Hindi SMT System

64 - Rajneesh Pandey , Atul Kr. Ojha , Girish Nath Jha 2018

The demo proposal presents a Phrase-based Sanskrit-Hindi (SaHiT) Statistical Machine Translation system. The system has been developed on Moses. 43k sentences of Sanskrit-Hindi parallel corpus and 56k sentences of a monolingual corpus in the target language (Hindi) have been used. This system gives 57 BLEU score.

Computation and Language

Hostility Detection in Hindi leveraging Pre-Trained Language Models

120 - Ojasv Kamal , Adarsh Kumar , Tejas Vaidhya 2021

Hostile content on social platforms is ever increasing. This has led to the need for proper detection of hostile posts so that appropriate action can be taken to tackle them. Though a lot of work has been done recently in the English Language to solve the problem of hostile content online, similar works in Indian Languages are quite hard to find. This paper presents a transfer learning based approach to classify social media (i.e Twitter, Facebook, etc.) posts in Hindi Devanagari script as Hostile or Non-Hostile. Hostile posts are further analyzed to determine if they are Hateful, Fake, Defamation, and Offensive. This paper harnesses attention based pre-trained models fine-tuned on Hindi data with Hostile-Non hostile task as Auxiliary and fusing its features for further sub-tasks classification. Through this approach, we establish a robust and consistent model without any ensembling or complex pre-processing. We have presented the results from our approach in CONSTRAINT-2021 Shared Task on hostile post detection where our model performs extremely well with 3rd runner up in terms of Weighted Fine-Grained F1 Score.

Computation and Language