No Arabic abstract
Real-world complex systems often comprise many distinct types of elements as well as many more types of networked interactions between elements. When the relative abundances of types can be measured well, we further observe heavy-tailed categorical distributions for type frequencies. For the comparison of type frequency distributions of two systems or a system with itself at different time points in time -- a facet of allotaxonometry -- a great range of probability divergences are available. Here, we introduce and explore `probability-turbulence divergence, a tunable, straightforward, and interpretable instrument for comparing normalizable categorical frequency distributions. We model probability-turbulence divergence (PTD) after rank-turbulence divergence (RTD). While probability-turbulence divergence is more limited in application than rank-turbulence divergence, it is more sensitive to changes in type frequency. We build allotaxonographs to display probability turbulence, incorporating a way to visually accommodate zero probabilities for `exclusive types which are types that appear in only one system. We explore comparisons of example distributions taken from literature, social media, and ecology. We show how probability-turbulence divergence either explicitly or functionally generalizes many existing kinds of distances and measures, including, as special cases, $L^{(p)}$ norms, the S{o}rensen-Dice coefficient (the $F_1$ statistic), and the Hellinger distance. We discuss similarities with the generalized entropies of R{e}nyi and Tsallis, and the diversity indices (or Hill numbers) from ecology. We close with thoughts on open problems concerning the optimization of the tuning of rank- and probability-turbulence divergence.
Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks. Comparisons of component size distributions for two complex systems---or a system with itself at two different time points---generally employ information-theoretic instruments, such as Jensen-Shannon divergence. We argue that these methods lack transparency and adjustability, and should not be applied when component probabilities are non-sensible or are problematic to estimate. Here, we introduce `allotaxonometry along with `rank-turbulence divergence, a tunable instrument for comparing any two (Zipfian) ranked lists of components. We analytically develop our rank-based divergence in a series of steps, and then establish a rank-based allotaxonograph which pairs a map-like histogram for rank-rank pairs with an ordered list of components according to divergence contribution. We explore the performance of rank-turbulence divergence for a series of distinct settings including: Language use on Twitter and in books, species abundance, baby name popularity, market capitalization, performance in sports, mortality causes, and job titles. We provide a series of supplementary flipbooks which demonstrate the tunability and storytelling power of rank-based allotaxonometry.
Quantifying the similarity between symbolic sequences is a traditional problem in Information Theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipfs law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences, e.g., they hinder an accurate finite-size estimation of entropies. Here we show analytically how the systematic (bias) and statistical (fluctuations) errors in these estimations depend on the sample size~$N$ and on the exponent~$gamma$ of the heavy-tailed distribution. Our results are valid for the Shannon entropy $(alpha=1)$, its corresponding similarity measures (e.g., the Jensen-Shanon divergence), and also for measures based on the generalized entropy of order $alpha$. For small $alpha$s, including $alpha=1$, the errors decay slower than the $1/N$-decay observed in short-tailed distributions. For $alpha$ larger than a critical value $alpha^* = 1+1/gamma leq 2$, the $1/N$-decay is recovered. We show the practical significance of our results by quantifying the evolution of the English language over the last two centuries using a complete $alpha$-spectrum of measures. We find that frequent words change more slowly than less frequent words and that $alpha=2$ provides the most robust measure to quantify language change.
Intervals between discrete events representing human activities, as well as other types of events, often obey heavy-tailed distributions, and their impacts on collective dynamics on networks such as contagion processes have been intensively studied. The literature supports that such heavy-tailed distributions are present for inter-event times associated with both individual nodes and individual edges in networks. However, the simultaneous presence of heavy-tailed distributions of inter-event times for nodes and edges is a non-trivial phenomenon, and its origin has been elusive. In the present study, we propose a generative model and its variants to explain this phenomenon. We assume that each node independently transits between a high-activity and low-activity state according to a continuous-time two-state Markov process and that, for the main model, events on an edge occur at a high rate if and only if both end nodes of the edge are in the high-activity state. In other words, two nodes interact frequently only when both nodes prefer to interact with others. The model produces distributions of inter-event times for both individual nodes and edges that resemble heavy-tailed distributions across some scales. It also produces positive correlation in consecutive inter-event times, which is another stylized observation for empirical data of human activity. We expect that our modeling framework provides a useful benchmark for investigating dynamics on temporal networks driven by non-Poissonian event sequences.
We offer a survey of recent results on covariance estimation for heavy-tailed distributions. By unifying ideas scattered in the literature, we propose user-friendly methods that facilitate practical implementation. Specifically, we introduce element-wise and spectrum-wise truncation operators, as well as their $M$-estimator counterparts, to robustify the sample covariance matrix. Different from the classical notion of robustness that is characterized by the breakdown property, we focus on the tail robustness which is evidenced by the connection between nonasymptotic deviation and confidence level. The key observation is that the estimators needs to adapt to the sample size, dimensionality of the data and the noise level to achieve optimal tradeoff between bias and robustness. Furthermore, to facilitate their practical use, we propose data-driven procedures that automatically calibrate the tuning parameters. We demonstrate their applications to a series of structured models in high dimensions, including the bandable and low-rank covariance matrices and sparse precision matrices. Numerical studies lend strong support to the proposed methods.
Large deviation theory and instanton calculus for stochastic systems are widely used to gain insight into the evolution and probability of rare events. At its core lies the realization that rare events are, under the right circumstances, dominated by their least unlikely realization. Their computation through a saddle-point approximation of the path integral for the corresponding stochastic field theory then reduces an inefficient stochastic sampling problem into a deterministic optimization problem: finding the path of smallest action, the instanton. In the presence of heavy tails, though, standard algorithms to compute the instanton critically fail to converge. The reason for this failure is the divergence of the scaled cumulant generating function (CGF) due to a non-convex large deviation rate function. We propose a solution to this problem by convexifying the rate function through nonlinear reparametrization of the observable, which allows us to compute instantons even in the presence of super-exponential or algebraic tail decay. The approach is generalizable to other situations where the existence of the CGF is required, such as exponential tilting in importance sampling for Monte-Carlo algorithms. We demonstrate the proposed formalism by applying it to rare events in several stochastic systems with heavy tails, including extreme power spikes in fiber optics induced by soliton formation.