ترغب بنشر مسار تعليمي؟ اضغط هنا

Understanding Zipfs law of word frequencies through sample-space collapse in sentence formation

148   0   0.0 ( 0 )
 نشر من قبل Bernat Corominas-Murtra BCM
 تاريخ النشر 2014
والبحث باللغة English




اسأل ChatGPT حول البحث

The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the history of word-usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of ten famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this `nestedness is tightly related to the power law exponents of the observed word frequency distributions. With the proposed model it is possible to understand that the nestedness of a text can be the origin of the actual scaling exponent, and that deviations from the exact Zipf law can be understood by variations of the degree of nestedness on a book-by-book basis. On a theoretical level we are able to show that in case of weak nesting, Zipfs law breaks down in a fast transition. Unlike previous attempts to understand Zipfs law in language the sample-space reducing model is not based on assumptions of multiplicative, preferential, or self-organised critical mechanisms behind language formation, but simply used the empirically quantifiable parameter nestedness to understand the statistics of word frequencies.

قيم البحث

اقرأ أيضاً

It has been shown recently that a specific class of path-dependent stochastic processes, which reduce their sample space as they unfold, lead to exact scaling laws in frequency and rank distributions. Such Sample Space Reducing processes (SSRP) offer an alternative new mechanism to understand the emergence of scaling in countless processes. The corresponding power law exponents were shown to be related to noise levels in the process. Here we show that the emergence of scaling is not limited to the simplest SSRPs, but holds for a huge domain of stochastic processes that are characterized by non-uniform prior distributions. We demonstrate mathematically that in the absence of noise the scaling exponents converge to $-1$ (Zipfs law) for almost all prior distributions. As a consequence it becomes possible to fully understand targeted diffusion on weighted directed networks and its associated scaling laws law in node visit distributions. The presence of cycles can be properly interpreted as playing the same role as noise in SSRPs and, accordingly, determine the scaling exponents. The result that Zipfs law emerges as a generic feature of diffusion on networks, regardless of its details, and that the exponent of visiting times is related to the amount of cycles in a network could be relevant for a series of applications in traffic-, transport- and supply chain management.
History-dependent processes are ubiquitous in natural and social systems. Many such stochastic processes, especially those that are associated with complex systems, become more constrained as they unfold, meaning that their sample-space, or their set of possible outcomes, reduces as they age. We demonstrate that these sample-space reducing (SSR) processes necessarily lead to Zipfs law in the rank distributions of their outcomes. We show that by adding noise to SSR processes the corresponding rank distributions remain exact power-laws, $p(x)sim x^{-lambda}$, where the exponent directly corresponds to the mixing ratio of the SSR process and noise. This allows us to give a precise meaning to the scaling exponent in terms of the degree to how much a given process reduces its sample-space as it unfolds. Noisy SSR processes further allow us to explain a wide range of scaling exponents in frequency distributions ranging from $alpha = 2$ to $infty$. We discuss several applications showing how SSR processes can be used to understand Zipfs law in word frequencies, and how they are related to diffusion processes in directed networks, or ageing processes such as in fragmentation processes. SSR processes provide a new alternative to understand the origin of scaling in complex systems without the recourse to multiplicative, preferential, or self-organised critical processes.
We propose hypotheses describing the empirical finding of an association between the exponents of urban GDP scaling and Zipfs law for cities. These hypotheses represent various combinations of directional or reciprocal causal links between the two ph enomena and include inter- and intra-city processes. Future theories and models can be motivated with and categorized according to these hypotheses. This paper intends to stimulate the discussion around the processes behind these phenomena and pave the way to a Unified Urban Theory.
Zipfs law describes the empirical size distribution of the components of many systems in natural and social sciences and humanities. We show, by solving a statistical model, that Zipfs law co-occurs with the maximization of the diversity of the compo nent sizes. The law ruling the increase of such diversity with the total dimension of the system is derived and its relation with Heaps law is discussed. As an example, we show that our analytical results compare very well with linguistics datasets.
90 - Liang Liu , Lili Yu 2015
The hierarchy of classical Chinese poetry has been broadly acknowledged by a number of studies in Chinese literature. However, quantitative investigations about the evolutionary linkages of classical Chinese poetry are limited. The primary goal of th is study is to provide quantitative evidence of the evolutionary linkages, with emphasis on character usage, among different period genres of classical Chinese poetry. Specifically, various statistical analyses are performed to find and compare the patterns of character usage in the poems of nine period genres, including shi jing, chu ci, Han shi , Jin shi, Tang shi, Song shi, Yuan shi, Ming shi, and Qing shi. The result of analysis indicates that each of nine period genres has unique patterns of character usage, with some Chinese characters that are preferably used in the poems of a particular period genre. The analysis on the general pattern of character preference implies a decreasing trend in the use of Chinese characters that rarely occur in modern Chinese literature along the timeline of dynastic types of classical Chinese poetry. The phylogenetic analysis based on the distance matrix suggests that the evolutionary linkages of different types of classical Chinese poetry are congruent with their chronological order, suggesting that character frequencies contain phylogenetic information that is useful for inferring evolutionary linkages among various types of classical Chinese poetry. The estimated phylogenetic tree identifies four groups (shi jing, chu ci), (Han shi, Jin shi), (Tang shi, Song shi, Yuan shi), and (Ming shi, Qing shi). The statistical analyses conducted in this study can be generalized to analyze the data sets of general Chinese literature. Such analyses can provide quantitative insights about the evolutionary linkages of general Chinese literature.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا