New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Similarity of symbol frequency distributions with heavy tails

54 0 0.0 ( 0 )

Download Cite

Added by Martin Gerlach

Publication date 2015

fields Physics Informatics Engineering

and research's language is English

Authors Martin Gerlach - Francesc Font-Clos - Eduardo G. Altmann

Physics and Society Computation and Language Data Analysis Statistics and Probability

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Quantifying the similarity between symbolic sequences is a traditional problem in Information Theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipfs law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences, e.g., they hinder an accurate finite-size estimation of entropies. Here we show analytically how the systematic (bias) and statistical (fluctuations) errors in these estimations depend on the sample size~$N$ and on the exponent~$gamma$ of the heavy-tailed distribution. Our results are valid for the Shannon entropy $(alpha=1)$, its corresponding similarity measures (e.g., the Jensen-Shanon divergence), and also for measures based on the generalized entropy of order $alpha$. For small $alpha$s, including $alpha=1$, the errors decay slower than the $1/N$-decay observed in short-tailed distributions. For $alpha$ larger than a critical value $alpha^* = 1+1/gamma leq 2$, the $1/N$-decay is recovered. We show the practical significance of our results by quantifying the evolution of the English language over the last two centuries using a complete $alpha$-spectrum of measures. We find that frequent words change more slowly than less frequent words and that $alpha=2$ provides the most robust measure to quantify language change.

rate research

Probability-turbulence divergence: A tunable allotaxonometric instrument for comparing heavy-tailed categorical distributions

97 - P. S. Dodds , J. R. Minot , M. V. Arnold 2020

Real-world complex systems often comprise many distinct types of elements as well as many more types of networked interactions between elements. When the relative abundances of types can be measured well, we further observe heavy-tailed categorical distributions for type frequencies. For the comparison of type frequency distributions of two systems or a system with itself at different time points in time -- a facet of allotaxonometry -- a great range of probability divergences are available. Here, we introduce and explore `probability-turbulence divergence, a tunable, straightforward, and interpretable instrument for comparing normalizable categorical frequency distributions. We model probability-turbulence divergence (PTD) after rank-turbulence divergence (RTD). While probability-turbulence divergence is more limited in application than rank-turbulence divergence, it is more sensitive to changes in type frequency. We build allotaxonographs to display probability turbulence, incorporating a way to visually accommodate zero probabilities for `exclusive types which are types that appear in only one system. We explore comparisons of example distributions taken from literature, social media, and ecology. We show how probability-turbulence divergence either explicitly or functionally generalizes many existing kinds of distances and measures, including, as special cases, $L^{(p)}$ norms, the S{o}rensen-Dice coefficient (the $F_1$ statistic), and the Hellinger distance. We discuss similarities with the generalized entropies of R{e}nyi and Tsallis, and the diversity indices (or Hill numbers) from ecology. We close with thoughts on open problems concerning the optimization of the tuning of rank- and probability-turbulence divergence.

Physics and Society Information Retrieval Data Analysis Statistics and Probability

Problems with classification, hypothesis testing, and estimator convergence in the analysis of degree distributions in networks

217 - Pim van der Hoorn , Ivan Voitalov , Remco van der Hofstad 2020

In their recent work Scale-free networks are rare, Broido and Clauset address the problem of the analysis of degree distributions in networks to classify them as scale-free at different strengths of scale-freeness. Over the last two decades, a multitude of papers in network science have reported that the degree distributions in many real-world networks follow power laws. Such networks were then referred to as scale-free. However, due to a lack of a precise definition, the term has evolved to mean a range of different things, leading to confusion and contradictory claims regarding scale-freeness of a given network. Recognizing this problem, the authors of Scale-free networks are rare try to fix it. They attempt to develop a versatile and statistically principled approach to remove this scale-free ambiguity accumulated in network science literature. Although their paper presents a fair attempt to address this fundamental problem, we must bring attention to some important issues in it.

Physics and Society Social and Information Networks Data Analysis Statistics and Probability

Equilibrium (Zipf) and Dynamic (Grasseberg-Procaccia) method based analyses of human texts. A comparison of natural (english) and artificial (esperanto) languages

110 - M. Ausloos 2008

A comparison of two english texts from Lewis Carroll, one (Alice in wonderland), also translated into esperanto, the other (Through a looking glass) are discussed in order to observe whether natural and artificial languages significantly differ from each other. One dimensional time series like signals are constructed using only word frequencies (FTS) or word lengths (LTS). The data is studied through (i) a Zipf method for sorting out correlations in the FTS and (ii) a Grassberger-Procaccia (GP) technique based method for finding correlations in LTS. Features are compared : different power laws are observed with characteristic exponents for the ranking properties, and the {it phase space attractor dimensionality}. The Zipf exponent can take values much less than unity ($ca.$ 0.50 or 0.30) depending on how a sentence is defined. This non-universality is conjectured to be a measure of the author $style$. Moreover the attractor dimension $r$ is a simple function of the so called phase space dimension $n$, i.e., $r = n^{lambda}$, with $lambda = 0.79$. Such an exponent should also conjecture to be a measure of the author $creativity$. However, even though there are quantitative differences between the original english text and its esperanto translation, the qualitative differences are very minutes, indicating in this case a translation relatively well respecting, along our analysis lines, the content of the author writing.

Physics and Society Computation and Language Data Analysis Statistics and Probability

Superstatistics with cut-off tails for financial time series

70 - Yusuke Uchiyama , Takanori Kadoya 2018

Financial time series have been investigated to follow fat-tailed distributions. Further, an empirical probability distribution sometimes shows cut-off shapes on its tails. To describe this stylized fact, we incorporate the cut-off effect in superstatistics. Then we confirm that the presented stochastic model is capable of describing the statistical properties of real financial time series. In addition, we present an option pricing formula with respect to superstatistics.

Statistical Finance Statistics Theory Data Analysis Statistics and Probability

Understanding the complexity of the Levy-walk nature of human mobility with a multi-scale cost/benefit model

162 - Nicola Scafetta 2012

Probability distributions of human displacements has been fit with exponentially truncated Levy flights or fat tailed Pareto inverse power law probability distributions. Thus, people usually stay within a given location (for example, the city of residence), but with a non-vanishing frequency they visit nearby or far locations too. Herein, we show that an important empirical distribution of human displacements (range: from 1 to 1000 km) can be well fit by three consecutive Pareto distributions with simple integer exponents equal to 1, 2 and ($gtrapprox$) 3. These three exponents correspond to three displacement range zones of about 1 km $lesssim Delta r lesssim$ 10 km, 10 km $lesssim Delta r lesssim$ 300 km and 300 km $lesssim Delta r lesssim $ 1000 km, respectively. These three zones can be geographically and physically well determined as displacements within a city, visits to nearby cities that may occur within just one-day trips, and visit to far locations that may require multi-days trips. The incremental integer values of the three exponents can be easily explained with a three-scale mobility cost/benefit model for human displacements based on simple geometrical constrains. Essentially, people would divide the space into three major regions (close, medium and far distances) and would assume that the travel benefits are randomly/uniformly distributed mostly only within specific urban-like areas.

Physics and Society Statistical Mechanics Data Analysis Statistics and Probability

comments

Fetching comments

Sohag University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Similarity of symbol frequency distributions with heavy tails

Ask ChatGPT about the research

No Arabic abstract

Read More