No Arabic abstract
Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available when the data are very high in dimension. Statistical Significance of Clustering (SigClust) is a recently developed cluster evaluation tool for high dimensional low sample size data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of type-I error, in the important case where there are huge single spikes in the eigenvalues. This paper addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues which leads to a much improved SigClust. These major improvements in SigClust performance are shown by both theoretical work and an extensive simulation study. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.
We consider perfect simulation algorithms for locally stable point processes based on dominated coupling from the past, and apply these methods in two different contexts. A new version of the algorithm is developed which is feasible for processes which are neither purely attractive nor purely repulsive. Such processes include multiscale area-interaction processes, which are capable of modelling point patterns whose clustering structure varies across scales. The other topic considered is nonparametric regression using wavelets, where we use a suitable area-interaction process on the discrete space of indices of wavelet coefficients to model the notion that if one wavelet coefficient is non-zero then it is more likely that neighbouring coefficients will be also. A method based on perfect simulation within this model shows promising results compared to the standard methods which threshold coefficients independently.
We introduce a new method of Bayesian wavelet shrinkage for reconstructing a signal when we observe a noisy version. Rather than making the common assumption that the wavelet coefficients of the signal are independent, we allow for the possibility that they are locally correlated in both location (time) and scale (frequency). This leads us to a prior structure which is analytically intractable, but it is possible to draw independent samples from a close approximation to the posterior distribution by an approach based on Coupling From The Past.
Mediation analyses are a statistical tool for testing the hypothesis about how the relationship between two variables may be direct or indirect via a third variable. Assessing statistical significance has been an area of active research; however, assessment of statistical power has been hampered by the lack of closed form calculations and the need for substantial amounts of computational simulations. The current work provides a detailed explanation of implementing large scale simulation procedures within a shared computing cluster environment. In addition, all results and code for implementing these procedures is publicly available. The resulting power analyses compare the effects of sample size and strength and direction of the relationships between the three variables. Comparisons of three confidence interval calculation methods demonstrated that the bias-corrected method is optimal and requires approximately ten less participants than the percentile method to achieve equivalent power. Differing strengths of distal and proximal effects were compared and did not differentially affect the power to detect mediation effects. Suppression effects were explored and demonstrate that in the presence of no observed relationship between two variables, entrance of the mediating variable into the model can reveal a suppressed relationship. The power to detect suppression effects is similar to unsuppressed mediation. These results and their methods provide important information about the power of mediation models for study planning. Of greater importance is that the methods lay the groundwork for assessment of statistical power of more complicated models involving multiple mediators and moderators.
Functional variables are often used as predictors in regression problems. A commonly-used parametric approach, called {it scalar-on-function regression}, uses the $ltwo$ inner product to map functional predictors into scalar responses. This method can perform poorly when predictor functions contain undesired phase variability, causing phases to have disproportionately large influence on the response variable. One past solution has been to perform phase-amplitude separation (as a pre-processing step) and then use only the amplitudes in the regression model. Here we propose a more integrated approach, termed elastic functional regression model (EFRM), where phase-separation is performed inside the regression model, rather than as a pre-processing step. This approach generalizes the notion of phase in functional data, and is based on the norm-preserving time warping of predictors. Due to its invariance properties, this representation provides robustness to predictor phase variability and results in improved predictions of the response variable over traditional models. We demonstrate this framework using a number of datasets involving gait signals, NMR data, and stock market prices.
Functional data registration is a necessary processing step for many applications. The observed data can be inherently noisy, often due to measurement error or natural process uncertainty, which most functional alignment methods cannot handle. A pair of functions can also have multiple optimal alignment solutions, which is not addressed in current literature. In this paper, a flexible Bayesian approach to functional alignment is presented, which appropriately accounts for noise in the data without any pre-smoothing required. Additionally, by running parallel MCMC chains, the method can account for multiple optimal alignments via the multi-modal posterior distribution of the warping functions. To most efficiently sample the warping functions, the approach relies on a modification of the standard Hamiltonian Monte Carlo to be well-defined on the infinite-dimensional Hilbert space. This flexible Bayesian alignment method is applied to both simulated data and real data sets to show its efficiency in handling noisy functions and successfully accounting for multiple optimal alignments in the posterior; characterizing the uncertainty surrounding the warping functions.