بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Quantitative Entropy Study of Language Complexity

102 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Weibing Deng

تاريخ النشر 2016

مجال البحث الهندسة المعلوماتية فيزياء

والبحث باللغة English

تأليف R.R. Xie - W.B. Deng - D.J. Wang

الحساب واللغة الفيزياء والمجتمع

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the entropy of Chinese and English texts, based on characters in case of Chinese texts and based on words for both languages. Significant differences are found between the languages and between different personal styles of debating partners. The entropy analysis points in the direction of lower entropy, that is of higher complexity. Such a text analysis would be applied for individuals of different styles, a single individual at different age, as well as different groups of the population.

قيم البحث

اقرأ أيضاً

The emerging field of language dynamics

308 - S. Wichmann 2008

A simple review by a linguist, citing many articles by physicists: Quantitative methods, agent-based computer simulations, language dynamics, language typology, historical linguistics

الحساب واللغة الفيزياء والمجتمع

Large-scale Quantitative Evidence of Media Impact on Public Opinion toward China

219 - Junming Huang , Gavin Cook , Yu Xie 2020

Do mass media influence peoples opinion of other countries? Using BERT, a deep neural network-based natural language processing model, we analyze a large corpus of 267,907 China-related articles published by The New York Times since 1970. We then com pare our output from The New York Times to a longitudinal data set constructed from 101 cross-sectional surveys of the American publics views on China. We find that the reporting of The New York Times on China in one year explains 54% of the variance in American public opinion on China in the next. Our result confirms hypothesized links between media and public opinion and helps shed light on how mass media can influence public opinion of foreign countries.

الحساب واللغة الفيزياء والمجتمع

Category-Theoretic Quantitative Compositional Distributional Models of Natural Language Semantics

960 - Edward Grefenstette 2013

This thesis is about the problem of compositionality in distributional semantics. Distributional semantics presupposes that the meanings of words are a function of their occurrences in textual contexts. It models words as distributions over these con texts and represents them as vectors in high dimensional spaces. The problem of compositionality for such models concerns itself with how to produce representations for larger units of text by composing the representations of smaller units of text. This thesis focuses on a particular approach to this compositionality problem, namely using the categorical framework developed by Coecke, Sadrzadeh, and Clark, which combines syntactic analysis formalisms with distributional semantic representations of meaning to produce syntactically motivated composition operations. This thesis shows how this approach can be theoretically extended and practically implemented to produce concrete compositional distributional models of natural language semantics. It furthermore demonstrates that such models can perform on par with, or better than, other competing approaches in the field of natural language processing. There are three principal contributions to computational linguistics in this thesis. The first is to extend the DisCoCat framework on the syntactic front and semantic front, incorporating a number of syntactic analysis formalisms and providing learning procedures allowing for the generation of concrete compositional distributional models. The second contribution is to evaluate the models developed from the procedures presented here, showing that they outperform other compositional distributional models present in the literature. The third contribution is to show how using category theory to solve linguistic problems forms a sound basis for research, illustrated by examples of work on this topic, that also suggest directions for future research.

الحساب واللغة التعلم الآلي نظرية الفئات

Semantic Complexity in End-to-End Spoken Language Understanding

349 - Joseph P. McKenna , Samridhi Choudhary , Michael Saxon 2020

End-to-end spoken language understanding (SLU) models are a class of model architectures that predict semantics directly from speech. Because of their input and output types, we refer to them as speech-to-interpretation (STI) models. Previous works h ave successfully applied STI models to targeted use cases, such as recognizing home automation commands, however no study has yet addressed how these models generalize to broader use cases. In this work, we analyze the relationship between the performance of STI models and the difficulty of the use case to which they are applied. We introduce empirical measures of dataset semantic complexity to quantify the difficulty of the SLU tasks. We show that near-perfect performance metrics for STI models reported in the literature were obtained with datasets that have low semantic complexity values. We perform experiments where we vary the semantic complexity of a large, proprietary dataset and show that STI model performance correlates with our semantic complexity measures, such that performance increases as complexity values decrease. Our results show that it is important to contextualize an STI models performance with the complexity values of its training dataset to reveal the scope of its applicability.

الحساب واللغة أنظمة الصوت في الحاسوب معالجة الصوت والكلام

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

62 - Martin Gerlach , Francesc Font-Clos 2018

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than $3 times 10^9$ word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

الحساب واللغة المكتبات الرقمية استرجاع المعلومات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة دمشق

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Quantitative Entropy Study of Language Complexity

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً