No Arabic abstract
Fine particulate matter (PM2.5) is a mixture of air pollutants that has adverse effects on human health. Understanding the health effects of PM2.5 mixture and its individual species has been a research priority over the past two decades. However, the limited availability of speciated PM2.5 measurements continues to be a major challenge in exposure assessment for conducting large-scale population-based epidemiology studies. The PM2.5 species have complex spatial-temporal and cross dependence structures that should be accounted for in estimating the spatiotemporal distribution of each component. Two major sources of air quality data are commonly used for deriving exposure estimates: point-level monitoring data and gridded numerical computer model simulation, such as the Community Multiscale Air Quality (CMAQ) model. We propose a statistical method to combine these two data sources for estimating speciated PM2.5 concentration. Our method models the complex relationships between monitoring measurements and the numerical model output at different spatial resolutions, and we model the spatial dependence and cross dependence among PM2.5 species. We apply the method to combine CMAQ model output with major PM2.5 species measurements in the contiguous United States in 2011.
Change point detection algorithms have numerous applications in fields of scientific and economic importance. We consider the problem of change point detection on compositional multivariate data (each sample is a probability mass function), which is a practically important sub-class of general multivariate data. While the problem of change-point detection is well studied in univariate setting, and there are few viable implementations for a general multivariate data, the existing methods do not perform well on compositional data. In this paper, we propose a parametric approach for change point detection in compositional data. Moreover, using simple transformations on data, we extend our approach to handle any general multivariate data. Experimentally, we show that our method performs significantly better on compositional data and is competitive on general data compared to the available state of the art implementations.
In applications of climate information, coarse-resolution climate projections commonly need to be downscaled to a finer grid. One challenge of this requirement is the modeling of sub-grid variability and the spatial and temporal dependence at the finer scale. Here, a post-processing procedure is proposed for temperature projections that addresses this challenge. The procedure employs statistical bias correction and stochastic downscaling in two steps. In a first step, errors that are related to spatial and temporal features of the first two moments of the temperature distribution at model scale are identified and corrected. Secondly, residual space-time dependence at the finer scale is analyzed using a statistical model, from which realizations are generated and then combined with appropriate climate change signal to form the downscaled projection fields. Using a high-resolution observational gridded data product, the proposed approach is applied in a case study where projections of two regional climate models from the EURO-CORDEX ensemble are bias-corrected and downscaled to a 1x1 km grid in the Trondelag area of Norway. A cross-validation study shows that the proposed procedure generates results that better reflect the marginal distributional properties of the data product and have better consistency in space and time than empirical quantile mapping.
Rescaled spike and slab models are a new Bayesian variable selection method for linear regression models. In high dimensional orthogonal settings such models have been shown to possess optimal model selection properties. We review background theory and discuss applications of rescaled spike and slab models to prediction problems involving orthogonal polynomials. We first consider global smoothing and discuss potential weaknesses. Some of these deficiencies are remedied by using local regression. The local regression approach relies on an intimate connection between local weighted regression and weighted generalized ridge regression. An important implication is that one can trace the effective degrees of freedom of a curve as a way to visualize and classify curvature. Several motivating examples are presented.
Studying the neurological, genetic and evolutionary basis of human vocal communication mechanisms is an important field of neuroscience. In the absence of high quality data on humans, mouse vocalization experiments in laboratory settings have been proven to be useful in providing valuable insights into mammalian vocal development and evolution, including especially the impact of certain genetic mutations. Data sets from mouse vocalization experiments usually consist of categorical syllable sequences along with continuous inter-syllable interval times for mice of different genotypes vocalizing under various contexts. Few statistical models have considered the inference for both transition probabilities and inter-state intervals. The latter is of particular importance as increased inter-state intervals can be an indication of possible vocal impairment. In this paper, we propose a class of novel Markov renewal mixed models that capture the stochastic dynamics of both state transitions and inter-state interval times. Specifically, we model the transition dynamics and the inter-state intervals using Dirichlet and gamma mixtures, respectively, allowing the mixture probabilities in both cases to vary flexibly with fixed covariate effects as well as random individual-specific effects. We apply our model to analyze the impact of a mutation in the Foxp2 gene on mouse vocal behavior. We find that genotypes and social contexts significantly affect the inter-state interval times but, compared to previous analyses, the influences of genotype and social context on the syllable transition dynamics are weaker.
Mass cytometry technology enables the simultaneous measurement of over 40 proteins on single cells. This has helped immunologists to increase their understanding of heterogeneity, complexity, and lineage relationships of white blood cells. Current statistical methods often collapse the rich single-cell data into summary statistics before proceeding with downstream analysis, discarding the information in these multivariate datasets. In this article, our aim is to exhibit the use of statistical analyses on the raw, uncompressed data thus improving replicability, and exposing multivariate patterns and their associated uncertainty profiles. We show that multivariate generative models are a valid alternative to univariate hypothesis testing. We propose two models: a multivariate Poisson log-normal mixed model and a logistic linear mixed model. We show that these models are complementary and that either model can account for different confounders. We use Hamiltonian Monte Carlo to provide Bayesian uncertainty quantification. Our models applied to a recent pregnancy study successfully reproduce key findings while quantifying increased overall protein-to-protein correlations between first and third trimester.