ترغب بنشر مسار تعليمي؟ اضغط هنا

Distinct word length frequencies: distributions and symbol entropies

386   0   0.0 ( 0 )
 نشر من قبل Reginald Smith
 تاريخ النشر 2012
والبحث باللغة English
 تأليف Reginald D. Smith




اسأل ChatGPT حول البحث

The distribution of frequency counts of distinct words by length in a languages vocabulary will be analyzed using two methods. The first, will look at the empirical distributions of several languages and derive a distribution that reasonably explains the number of distinct words as a function of length. We will be able to derive the frequency count, mean word length, and variance of word length based on the marginal probability of letters and spaces. The second, based on information theory, will demonstrate that the conditional entropies can also be used to estimate the frequency of distinct words of a given length in a language. In addition, it will be shown how these techniques can also be applied to estimate higher order entropies using vocabulary word length.



قيم البحث

اقرأ أيضاً

This note derives the various forms of entropy of systems subject to Olbert distributions (generalized Lorentzian probability distributions known as $kappa$-distributions) which are frequently observed particularly in high temperature plasmas. The ge neral expression of the partition function in such systems is given as well in a form similar to the Boltzmann-Gibbs probability distribution, including a possible exponential high energy truncation. We find the representation of the mean energy as function of probability, and provide the implicit form of Olbert (Lorentzian) entropy as well as its high temperature limit. The relation to phase space density of states is obtained. We then find the entropy as function of probability, an expression which is fundamental to statistical mechanics and here to its Olbertian version. Lorentzian systems through internal collective interactions cause correlations which add to the entropy. Fermi systems do not obey Olbert statistics, while Bose systems might at temperatures sufficiently far from zero.
With such increasing popularity and availability of digital text data, authorships of digital texts can not be taken for granted due to the ease of copying and parsing. This paper presents a new text style analysis called natural frequency zoned word distribution analysis (NFZ-WDA), and then a basic authorship attribution scheme and an open authorship attribution scheme for digital texts based on the analysis. NFZ-WDA is based on the observation that all authors leave distinct intrinsic word usage traces on texts written by them and these intrinsic styles can be identified and employed to analyze the authorship. The intrinsic word usage styles can be estimated through the analysis of word distribution within a text, which is more than normal word frequency analysis and can be expressed as: which groups of words are used in the text; how frequently does each group of words occur; how are the occurrences of each group of words distributed in the text. Next, the basic authorship attribution scheme and the open authorship attribution scheme provide solutions for both closed and open authorship attribution problems. Through analysis and extensive experimental studies, this paper demonstrates the efficiency of the proposed method for authorship attribution.
A measure called Physical Complexity is established and calculated for a population of sequences, based on statistical physics, automata theory, and information theory. It is a measure of the quantity of information in an organisms genome. It is base d on Shannons entropy, measuring the information in a population evolved in its environment, by using entropy to estimate the randomness in the genome. It is calculated from the difference between the maximal entropy of the population and the actual entropy of the population when in its environment, estimated by counting the number of fixed loci in the sequences of a population. Up to now, Physical Complexity has only been formulated for populations of sequences with the same length. Here, we investigate an extension to support variable length populations. We then build upon this to construct a measure for the efficiency of information storage, which we later use in understanding clustering within populations. Finally, we investigate our extended Physical Complexity through simulations, showing it to be consistent with the original.
In a recent paper [textit{M. Cristelli, A. Zaccaria and L. Pietronero, Phys. Rev. E 85, 066108 (2012)}], Cristelli textit{et al.} analysed relation between skewness and kurtosis for complex dynamical systems and identified two power-law regimes of no n-Gaussianity, one of which scales with an exponent of 2 and the other is with $4/3$. Finally the authors concluded that the observed relation is a universal fact in complex dynamical systems. Here, we test the proposed universal relation between skewness and kurtosis with large number of synthetic data and show that in fact it is not universal and originates only due to the small number of data points in the data sets considered. The proposed relation is tested using two different non-Gaussian distributions, namely $q$-Gaussian and Levy distributions. We clearly show that this relation disappears for sufficiently large data sets provided that the second moment of the distribution is finite. We find that, contrary to the claims of Cristelli textit{et al.} regarding a power-law scaling regime, kurtosis saturates to a single value, which is of course different from the Gaussian case ($K=3$), as the number of data is increased. On the other hand, if the second moment of the distribution is infinite, then the kurtosis seems to never converge to a single value. The converged kurtosis value for the finite second moment distributions and the number of data points needed to reach this value depend on the deviation of the original distribution from the Gaussian case. We also argue that the use of kurtosis to compare distributions to decide which one deviates from the Gaussian more can lead to incorrect results even for finite second moment distributions for small data sets, whereas it is totally misleading for infinite second moment distributions where the difference depends on $N$ for all finite $N$.
Social impacts and degrees of organization inherent to opinion formation for interacting agents on networks present interesting questions of general interest from physics to sociology. We present a quantitative analysis of a case implying an evolving small size network, i.e. that inherent to the ongoing debate between modern creationists (most are Intelligent Design (ID) proponents (IDP)) and Darwins theory of Evolution Defenders (DED)). This study is carried out by analyzing the structural properties of the citation network unfolded in the recent decades by publishing works belonging to members of the two communities. With the aim of capturing the dynamical aspects of the interaction between the IDP and DED groups, we focus on $two$ key quantities, namely, the {it degree of activity} of each group and the corresponding {it degree of impact} on the intellectual community at large. A representative measure of the former is provided by the {it rate of production of publications} (RPP), whilst the latter can be assimilated to the{it rate of increase in citations} (RIC). These quantities are determined, respectively, by the slope of the time series obtained for the number of publications accumulated per year and by the slope of a similar time series obtained for the corresponding citations. The results indicate that in this case, the dynamics can be seen as geared by triggered or damped competition. The network is a specific example of marked heterogeneity in exchange of information activity in and between the communities, particularly demonstrated through the nodes having a high connectivity degree, i.e. opinion leaders.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا