Do you want to publish a course? Click here

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

79   0   0.0 ( 0 )
 Added by Hossein Hassani
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish language is a multi-dialect, under-resourced language which is written in different scripts. The lack of various segmented corpora is one of the major bottlenecks in Kurdish language processing. We used Punkt, an unsupervised machine learning method, to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. According to the literature, studies on using Punkt on non-Latin data are scanty. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is mainly due to the situation of abbreviations in Kurdish and partly because of ordinal numerals. The data is publicly available at https://github.com/KurdishBLARK/ KTC-Segmented for non-commercial use under the CC BY-NC-SA 4.0 licence.

rate research

Read More

62 - Sina Ahmadi 2021
Sorani Kurdish, also known as Central Kurdish, has a complex morphology, particularly due to the patterns in which morphemes appear. Although several aspects of Kurdish morphology have been studied, such as pronominal endoclitics and Izafa constructions, Sorani Kurdish morphology has received trivial attention in computational linguistics. Moreover, some morphemes, such as the emphasis endoclitic =^ic{s}, and derivational morphemes have not been previously studied. To tackle the complex morphology of Sorani, we provide a thorough description of Sorani Kurdish morphological and morphophonological constructions in a formal way such that they can be used as finite-state transducers for morphological analysis and synthesis.
90 - Sina Ahmadi 2021
Spell checking and morphological analysis are two fundamental tasks in text and natural language processing and are addressed in the early stages of the development of language technology. Despite the previous efforts, there is no progress in open-source to create such tools for Sorani Kurdish, also known as Central Kurdish, as a less-resourced language. In this paper, we present our efforts in annotating a lexicon with morphosyntactic tags and also, extracting morphological rules of Sorani Kurdish to build a morphological analyzer, a stemmer and a spell-checking system using Hunspell. This implementation can be used for further developments in the field by researchers and also, be integrated into text editors under a publicly available license.
The present paper aims at presenting a lemmatization and a word-level error correction system for Sorani Kurdish. We propose a hybrid approach based on the morphological rules and a n-gram language model. We have called our lemmatization and error correction systems Peyv and R^en^us respectively, which are the first tools presented for Sorani Kurdish to the best of our knowledge. The Peyv lemmatizer has shown 86.7% accuracy. As for R^en^us, using a lexicon, we have obtained 96.4% accuracy while without a lexicon, the correction system has 87% accuracy. As two fundamental text processing tools, these tools can pave the way for further researches on more natural language processing applications for Sorani Kurdish.
We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the project was to develop a system that automatically could recognize simple sentences based on the vocabulary which is used in grades one to three of the primary schools in the Kurdistan Region of Iraq. We used CMUSphinx as our experimental environment. We developed a dataset to train the system. The dataset is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.
105 - Qingyu Chen , Yifan Peng , 2018
Sentence embeddings have become an essential part of todays natural language processing (NLP) systems, especially together advanced deep learning methods. Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings. BioSentVec is publicly available at https://github.com/ncbi-nlp/BioSentVec
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا