No Arabic abstract
Biomedical researchers usually study the effects of certain exposures on disease risks among a well-defined population. To achieve this goal, the gold standard is to design a trial with an appropriate sample from that population. Due to the high cost of such trials, usually the sample size collected is limited and is not enough to accurately estimate some exposures effect. In this paper, we discuss how to leverage the information from external `big data (data with much larger sample size) to improve the estimation accuracy at the risk of introducing small bias. We proposed a family of weighted estimators to balance the bias increase and variance reduction when including the big data. We connect our proposed estimator to the established penalized regression estimators. We derive the optimal weights using both second order and higher order asymptotic expansions. Using extensive simulation studies, we showed that the improvement in terms of mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes and our weighted method outperformed the existing methods such as penalized regression and James Steins approach. Also we provide theoretical guarantee that the proposed estimators will never lead to asymptotic MSE larger than the maximum likelihood estimator using small data only in general. We applied our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use and mortality.
Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1pmvarepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.
This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.
We want to analyse EEG recordings in order to investigate the phonemic categorization at a very early stage of auditory processing. This problem can be modelled by a supervised classification of functional data. Discrimination is explored via a logistic functional linear model, using a wavelet representation of the data. Different procedures are investigated, based on penalized likelihood and principal component reduction or partial least squares reduction.
For random field theory based multiple comparison corrections In brain imaging, it is often necessary to compute the distribution of the supremum of a random field. Unfortunately, computing the distribution of the supremum of the random field is not easy and requires satisfying many distributional assumptions that may not be true in real data. Thus, there is a need to come up with a different framework that does not use the traditional statistical hypothesis testing paradigm that requires to compute p-values. With this as a motivation, we can use a different approach called the logistic regression that does not require computing the p-value and still be able to localize the regions of brain network differences. Unlike other discriminant and classification techniques that tried to classify preselected feature vectors, the method here does not require any preselected feature vectors and performs the classification at each edge level.
We propose a versatile joint regression framework for count responses. The method is implemented in the R add-on package GJRM and allows for modelling linear and non-linear dependence through the use of several copulae. Moreover, the parameters of the marginal distributions of the count responses and of the copula can be specified as flexible functions of covariates. Motivated by a football application, we also discuss an extension which forces the regression coefficients of the marginal (linear) predictors to be equal via a suitable penalisation. Model fitting is based on a trust region algorithm which estimates simultaneously all the parameters of the joint models. We investigate the proposals empirical performance in two simulation studies, the first one designed for arbitrary count data, the other one reflecting football-specific settings. Finally, the method is applied to FIFA World Cup data, showing its competitiveness to the standard approach with regard to predictive performance.