No Arabic abstract
Traffic flow count data in networks arise in many applications, such as automobile or aviation transportation, certain directed social network contexts, and Internet studies. Using an example of Internet browser traffic flow through site-segments of an international news website, we present Bayesian analyses of two linked classes of models which, in tandem, allow fast, scalable and interpretable Bayesian inference. We first develop flexible state-space models for streaming count data, able to adaptively characterize and quantify network dynamics efficiently in real-time. We then use these models as emulators of more structured, time-varying gravity models that allow formal dissection of network dynamics. This yields interpretable inferences on traffic flow characteristics, and on dynamics in interactions among network nodes. Bayesian monitoring theory defines a strategy for sequential model assessment and adaptation in cases when network flow data deviates from model-based predictions. Exploratory and sequential monitoring analyses of evolving traffic on a network of web site-segments in e-commerce demonstrate the utility of this coupled Bayesian emulation approach to analysis of streaming network count data.
Bayesian networks are a versatile and powerful tool to model complex phenomena and the interplay of their components in a probabilistically principled way. Moving beyond the comparatively simple case of completely observed, static data, which has received the most attention in the literature, in this paper we will review how Bayesian networks can model dynamic data and data with incomplete observations. Such data are the norm at the forefront of research and in practical applications, and Bayesian networks are uniquely positioned to model them due to their explainability and interpretability.
Several methods have been proposed in the spatial statistics literature for the analysis of big data sets in continuous domains. However, new methods for analyzing high-dimensional areal data are still scarce. Here, we propose a scalable Bayesian modeling approach for smoothing mortality (or incidence) risks in high-dimensional data, that is, when the number of small areas is very large. The method is implemented in the R add-on package bigDM. Model fitting and inference is based on the idea of divide and conquer and use integrated nested Laplace approximations and numerical integration. We analyze the proposals empirical performance in a comprehensive simulation study that consider two model-free settings. Finally, the methodology is applied to analyze male colorectal cancer mortality in Spanish municipalities showing its benefits with regard to the standard approach in terms of goodness of fit and computational time.
Modeling correlation (and covariance) matrices can be challenging due to the positive-definiteness constraint and potential high-dimensionality. Our approach is to decompose the covariance matrix into the correlation and variance matrices and propose a novel Bayesian framework based on modeling the correlations as products of unit vectors. By specifying a wide range of distributions on a sphere (e.g. the squared-Dirichlet distribution), the proposed approach induces flexible prior distributions for covariance matrices (that go beyond the commonly used inverse-Wishart prior). For modeling real-life spatio-temporal processes with complex dependence structures, we extend our method to dynamic cases and introduce unit-vector Gaussian process priors in order to capture the evolution of correlation among components of a multivariate time series. To handle the intractability of the resulting posterior, we introduce the adaptive $Delta$-Spherical Hamiltonian Monte Carlo. We demonstrate the validity and flexibility of our proposed framework in a simulation study of periodic processes and an analysis of rats local field potential activity in a complex sequence memory task.
In this article, motivated by biosurveillance and censoring sensor networks, we investigate the problem of distributed monitoring large-scale data streams where an undesired event may occur at some unknown time and affect only a few unknown data streams. We propose to develop scalable global monitoring schemes by parallel running local detection procedures and by combining these local procedures together to make a global decision based on SUM-shrinkage techniques. Our approach is illustrated in two concrete examples: one is the nonhomogeneous case when the pre-change and post-change local distributions are given, and the other is the homogeneous case of monitoring a large number of independent $N(0,1)$ data streams where the means of some data streams might shift to unknown positive or negative values. Numerical simulation studies demonstrate the usefulness of the proposed schemes.
We propose a novel approach to estimating the precision matrix of multivariate Gaussian data that relies on decomposing them into a low-rank and a diagonal component. Such decompositions are very popular for modeling large covariance matrices as they admit a latent factor based representation that allows easy inference. The same is not true for precision matrices, due to the lack of computationally convenient representation, which restricts the use to low to moderate dimensional problems. We address this remarkable gap in the literature by introducing a novel latent variable representation for such decomposition for precision matrices as well. The construction leads to an efficient Gibbs sampler that scales very well to high-dimensional problems far beyond the limits of the current state-of-the-art. The ability to efficiently explore the full posterior space allows the model uncertainty to be easily assessed. The decomposition also crucially allows us to adapt sparsity inducing priors to shrink the insignificant entries of the precision matrix toward zero, making the approach adaptable to high-dimensional small-sample-size sparse settings. Exact zeros in the matrix encoding the underlying conditional independence graph are then determined via a novel posterior false discovery rate control procedure. We evaluate the methods empirical performance through synthetic experiments and illustrate its practical utility in data sets from two different application domains.