بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Statistical patterns of word frequency suggesting the probabilistic nature of human languages

83 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Haitao Liu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية فيزياء

والبحث باللغة English

تأليف Shuiyuan Yu - Chunshan Xu - Haitao Liu

الحساب واللغة الفيزياء الحسابية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Traditional linguistic theories have largely regard language as a formal system composed of rigid rules. However, their failures in processing real language, the recent successes in statistical natural language processing, and the findings of many psychological experiments have suggested that language may be more a probabilistic system than a formal system, and thus cannot be faithfully modeled with the either/or rules of formal linguistic theory. The present study, based on authentic language data, confirmed that those important linguistic issues, such as linguistic universal, diachronic drift, and language variations can be translated into probability and frequency patterns in parole. These findings suggest that human language may well be probabilistic systems by nature, and that statistical may well make inherent properties of human languages.

قيم البحث

اقرأ أيضاً

Learning Word Vectors for 157 Languages

107 - Edouard Grave , Piotr Bojanowski , Prakhar Gupta 2018

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train t hem on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

الحساب واللغة التعلم الآلي

Inferring human mobility using communication patterns

832 - Vasyl Palchykov , Marija Mitrovic , Hang-Hyun Jo 2014

Understanding the patterns of mobility of individuals is crucial for a number of reasons, from city planning to disaster management. There are two common ways of quantifying the amount of travel between locations: by direct observations that often in volve privacy issues, e.g., tracking mobile phone locations, or by estimations from models. Typically, such models build on accurate knowledge of the population size at each location. However, when this information is not readily available, their applicability is rather limited. As mobile phones are ubiquitous, our aim is to investigate if mobility patterns can be inferred from aggregated mobile phone call data alone. Using data released by Orange for Ivory Coast, we show that human mobility is well predicted by a simple model based on the frequency of mobile phone calls between two locations and their geographical distance. We argue that the strength of the model comes from directly incorporating the social dimension of mobility. Furthermore, as only aggregated call data is required, the model helps to avoid potential privacy problems.

الفيزياء والمجتمع الفيزياء الحسابية تحليل البيانات والإحصاءات والاحتمال

Analyzing the Surprising Variability in Word Embedding Stability Across Languages

263 - Laura Burdick , Jonathan K. Kummerfeld , Rada Mihalcea 2020

Word embeddings are powerful representations that form the foundation of many natural language processing architectures, both in English and in other languages. To gain further insight into word embeddings, we explore their stability (e.g., overlap b etween the nearest neighbors of a word in different embedding spaces) in diverse languages. We discuss linguistic properties that are related to stability, drawing out insights about correlations with affixing, language gender systems, and other features. This has implications for embedding use, particularly in research that uses them to study language trends.

الحساب واللغة

Debiasing Multilingual Word Embeddings: A Case Study of Three Indian Languages

113 - Srijan Bansal , Vishal Garimella , Ayush Suhane 2021

In this paper, we advance the current state-of-the-art method for debiasing monolingual word embeddings so as to generalize well in a multilingual setting. We consider different methods to quantify bias and different debiasing approaches for monoling ual as well as multilingual settings. We demonstrate the significance of our bias-mitigation approach on downstream NLP applications. Our proposed methods establish the state-of-the-art performance for debiasing multilingual embeddings for three Indian languages - Hindi, Bengali, and Telugu in addition to English. We believe that our work will open up new opportunities in building unbiased downstream NLP applications that are inherently dependent on the quality of the word embeddings used.

الحساب واللغة

Word Familiarity and Frequency

114 - Kumiko Tanaka-Ishii , Hiroshi Terada 2018

Word frequency is assumed to correlate with word familiarity, but the strength of this correlation has not been thoroughly investigated. In this paper, we report on our analysis of the correlation between a word familiarity rating list obtained throu gh a psycholinguistic experiment and the log-frequency obtained from various corpora of different kinds and sizes (up to the terabyte scale) for English and Japanese. Major findings are threefold: First, for a given corpus, familiarity is necessary for a word to achieve high frequency, but familiar words are not necessarily frequent. Second, correlation increases with the corpus data size. Third, a corpus of spoken language correlates better than one of written language. These findings suggest that cognitive familiarity ratings are correlated to frequency, but more highly to that of spoken rather than written language.

الحساب واللغة

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة الافتراضية السورية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Statistical patterns of word frequency suggesting the probabilistic nature of human languages

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً