Gender Prediction Based on Vietnamese Names with Machine Learning Techniques

114 0 0.0 ( 0 )

Download Cite

Added by Huy Quoc To

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Huy Quoc To - Kiet Van Nguyen - Ngan Luu-Thuy Nguyen

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model.

rate research

Sentence Extraction-Based Machine Reading Comprehension for Vietnamese

118 - Phong Nguyen-Thuan Do , Nhat Duy Nguyen , Tin Van Huynh 2021

The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In this paper, we introduce UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of comprises 23.074 question-answers based on 5.109 passages of 174 Wikipedia Vietnamese articles. We propose a conversion algorithm to create the dataset for sentence extraction-based machine reading comprehension and three types of approaches for sentence extraction-based machine reading comprehension in Vietnamese. Our experiments show that the best machine model is XLM-R_Large, which achieves an exact match (EM) of 85.97% and an F1-score of 88.77% on our dataset. Besides, we analyze experimental results in terms of the question type in Vietnamese and the effect of context on the performance of the MRC models, thereby showing the challenges from the UIT-ViWikiQA dataset that we propose to the language processing community.

Computation and Language

IoT Security Techniques Based on Machine Learning

256 - Liang Xiao , Xiaoyue Wan , Xiaozhen Lu 2018

Internet of things (IoT) that integrate a variety of devices into networks to provide advanced and intelligent services have to protect user privacy and address attacks such as spoofing attacks, denial of service attacks, jamming and eavesdropping. In this article, we investigate the attack model for IoT systems, and review the IoT security solutions based on machine learning techniques including supervised learning, unsupervised learning and reinforcement learning. We focus on the machine learning based IoT authentication, access control, secure offloading and malware detection schemes to protect data privacy. In this article, we discuss the challenges that need to be addressed to implement these machine learning based security schemes in practical IoT systems.

Cryptography and Security

Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension

86 - Kiet Van Nguyen , Khiem Vinh Tran , Son T. Luu 2020

Although Vietnamese is the 17th most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website for research purposes.

Computation and Language

Automated Agriculture Commodity Price Prediction System with Machine Learning Techniques

119 - Zhiyuan Chen , Howe Seng Goh , Kai Ling Sin 2021

The intention of this research is to study and design an automated agriculture commodity price prediction system with novel machine learning techniques. Due to the increasing large amounts historical data of agricultural commodity prices and the need of performing accurate prediction of price fluctuations, the solution has largely shifted from statistical methods to machine learning area. However, the selection of proper set from historical data for forecasting still has limited consideration. On the other hand, when implementing machine learning techniques, finding a suitable model with optimal parameters for global solution, nonlinearity and avoiding curse of dimensionality are still biggest challenges, therefore machine learning strategies study are needed. In this research, we propose a web-based automated system to predict agriculture commodity price. In the two series experiments, five popular machine learning algorithms, ARIMA, SVR, Prophet, XGBoost and LSTM have been compared with large historical datasets in Malaysia and the most optimal algorithm, LSTM model with an average of 0.304 mean-square error has been selected as the prediction engine of the proposed system.

Machine Learning

Tokamak disruption prediction using different machine learning techniques

299 - Joost Croonen , Jorge Amaya , Giovanni Lapenta 2020

Disruption prediction and mitigation is of key importance in the development of sustainable tokamakreactors. Machine learning has become a key tool in this endeavour. In this paper multiple machinelearning models will be tested and compared. A particular focus has been placed on their portability.This describes how easily the models can be used with data from new devices. The methods used inthis paper are support vector machine, 2-tiered support vector machine, random forest, gradient boostedtrees and long-short term memory. The results show that the support vector machine performanceis marginally better among the standard models, while the gradient boosted trees performed the worst.The portable variant of each model had lower performance. Random forest obtained the highest portableperformance. Results also suggest that disruptions can be detected as early as 600ms before the event.An analysis of the computational cost showed all models run in less than 1ms, allowing sufficient timefor disruption mitigation.

Plasma Physics Signal Processing