No Arabic abstract
It is well-known that online behavior is long-tailed, with most cascaded actions being short and a few being very long. A prominent drawback in generative models for online events is the inability to describe unpopular items well. This work addresses these shortcomings by proposing dual mixture self-exciting processes to jointly learn from groups of cascades. We first start from the observation that maximum likelihood estimates for content virality and influence decay are separable in a Hawkes process. Next, our proposed model, which leverages a Borel mixture model and a kernel mixture model, jointly models the unfolding of a heterogeneous set of cascades. When applied to cascades of the same online items, the model directly characterizes their spread dynamics and supplies interpretable quantities, such as content virality and content influence decay, as well as methods for predicting the final content popularities. On two retweet cascade datasets -- one relating to YouTube videos and the second relating to controversial news articles -- we show that our models capture the differences between online items at the granularity of items, publishers and categories. In particular, we are able to distinguish between far-right, conspiracy, controversial and reputable online news articles based on how they diffuse through social media, achieving an F1 score of 0.945. On holdout datasets, we show that the dual mixture model provides, for reshare diffusion cascades especially unpopular ones, better generalization performance and, for online items, accurate item popularity predictions.
Epidemic models and self-exciting processes are two types of models used to describe diffusion phenomena online and offline. These models were originally developed in different scientific communities, and their commonalities are under-explored. This work establishes, for the first time, a general connection between the two model classes via three new mathematical components. The first is a generalized version of stochastic Susceptible-Infected-Recovered (SIR) model with arbitrary recovery time distributions; the second is the relationship between the (latent and arbitrary) recovery time distribution, recovery hazard function, and the infection kernel of self-exciting processes; the third includes methods for simulating, fitting, evaluating and predicting the generalized process. On three large Twitter diffusion datasets, we conduct goodness-of-fit tests and holdout log-likelihood evaluation of self-exciting processes with three infection kernels --- exponential, power-law and Tsallis Q-exponential. We show that the modeling performance of the infection kernels varies with respect to the temporal structures of diffusions, and also with respect to user behavior, such as the likelihood of being bots. We further improve the prediction of popularity by combining two models that are identified as complementary by the goodness-of-fit tests.
Modeling online discourse dynamics is a core activity in understanding the spread of information, both offline and online, and emergent online behavior. There is currently a disconnect between the practitioners of online social media analysis -- usually social, political and communication scientists -- and the accessibility to tools capable of examining online discussions of users. Here we present evently, a tool for modeling online reshare cascades, and particularly retweet cascades, using self-exciting processes. It provides a comprehensive set of functionalities for processing raw data from Twitter public APIs, modeling the temporal dynamics of processed retweet cascades and characterizing online users with a wide range of diffusion measures. This tool is designed for researchers with a wide range of computer expertise, and it includes tutorials and detailed documentation. We illustrate the usage of evently with an end-to-end analysis of online user behavior on a topical dataset relating to COVID-19. We show that, by characterizing users solely based on how their content spreads online, we can disentangle influential users and online bots.
We propose a new multifractional stochastic process which allows for self-exciting behavior, similar to what can be seen for example in earthquakes and other self-organizing phenomena. The process can be seen as an extension of a multifractional Brownian motion, where the Hurst function is dependent on the past of the process. We define this through a stochastic Volterra equation, and we prove existence and uniqueness of this equation, as well as give bounds on the p-order moments, for all p>=1. We show convergence of an Euler-Maruyama scheme for the process, and also give the rate of convergence, which is depending on the self-exciting dynamics of the process. Moreover, we discuss different applications of this process, and give examples of different functions to model self-exciting behavior.
The paper discusses multivariate self- and cross-exciting processes. We define a class of multivariate point processes via their corresponding stochastic intensity processes that are driven by stochastic jumps. Essentially, there is a jump in an intensity process whenever the corresponding point process records an event. An attribute of our modelling class is that not only a jump is recorded at each instance, but also its magnitude. This allows large jumps to influence the intensity to a larger degree than smaller jumps. We give conditions which guarantee that the process is stable, in the sense that it does not explode, and provide a detailed discussion on when the subclass of linear models is stable. Finally, we fit our model to financial time series data from the S&P 500 and Nikkei 225 indices respectively. We conclude that a nonlinear variant from our modelling class fits the data best. This supports the observation that in times of crises (high intensity) jumps tend to arrive in clusters, whereas there are typically longer times between jumps when the markets are calmer. We moreover observe more variability in jump sizes when the intensity is high, than when it is low.
Recently there have been many research efforts in developing generative models for self-exciting point processes, partly due to their broad applicability for real-world applications. However, rarely can we quantify how well the generative model captures the nature or ground-truth since it is usually unknown. The challenge typically lies in the fact that the generative models typically provide, at most, good approximations to the ground-truth (e.g., through the rich representative power of neural networks), but they cannot be precisely the ground-truth. We thus cannot use the classic goodness-of-fit (GOF) test framework to evaluate their performance. In this paper, we develop a GOF test for generative models of self-exciting processes by making a new connection to this problem with the classical statistical theory of Quasi-maximum-likelihood estimator (QMLE). We present a non-parametric self-normalizing statistic for the GOF test: the Generalized Score (GS) statistics, and explicitly capture the model misspecification when establishing the asymptotic distribution of the GS statistic. Numerical simulation and real-data experiments validate our theory and demonstrate the proposed GS tests good performance.