No Arabic abstract
We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a {em single} sample. {We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed.}
Acquisition of data is a difficult task in many applications of machine learning, and it is only natural that one hopes and expects the populating risk to decrease (better performance) monotonically with increasing data points. It turns out, somewhat surprisingly, that this is not the case even for the most standard algorithms such as empirical risk minimization. Non-monotonic behaviour of the risk and instability in training have manifested and appeared in the popular deep learning paradigm under the description of double descent. These problems highlight bewilderment in our understanding of learning algorithms and generalization. It is, therefore, crucial to pursue this concern and provide a characterization of such behaviour. In this paper, we derive the first consistent and risk-monotonic algorithms for a general statistical learning setting under weak assumptions, consequently resolving an open problem (Viering et al. 2019) on how to avoid non-monotonic behaviour of risk curves. Our work makes a significant contribution to the topic of risk-monotonicity, which may be key in resolving empirical phenomena such as double descent.
Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression -- a vanilla algorithm for learning quantiles with asymptotic guarantees -- tends to emph{under-cover} than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for $alpha>0.5$ and small $d/n$, the $alpha$-quantile learned by quantile regression roughly achieves coverage $alpha - (alpha-1/2)cdot d/n$ regardless of the noise distribution, where $d$ is the input dimension and $n$ is the number of training data. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.
Paired estimation of change in parameters of interest over a population plays a central role in several application domains including those in the social sciences, epidemiology, medicine and biology. In these domains, the size of the population under study is often very large, however, the number of observations available per individual in the population is very small (emph{sparse observations}) which makes the problem challenging. Consider the setting with $N$ independent individuals, each with unknown parameters $(p_i, q_i)$ drawn from some unknown distribution on $[0, 1]^2$. We observe $X_i sim text{Bin}(t, p_i)$ before an event and $Y_i sim text{Bin}(t, q_i)$ after the event. Provided these paired observations, ${(X_i, Y_i) }_{i=1}^N$, our goal is to accurately estimate the emph{distribution of the change in parameters}, $delta_i := q_i - p_i$, over the population and properties of interest like the emph{$ell_1$-magnitude of the change} with sparse observations ($tll N$). We provide emph{information theoretic lower bounds} on the error in estimating the distribution of change and the $ell_1$-magnitude of change. Furthermore, we show that the following two step procedure achieves the optimal error bounds: first, estimate the full joint distribution of the paired parameters using the maximum likelihood estimator (MLE) and then estimate the distribution of change and the $ell_1$-magnitude of change using the joint MLE. Notably, and perhaps surprisingly, these error bounds are of the same order as the minimax optimal error bounds for learning the emph{full} joint distribution itself (in Wasserstein-1 distance); in other words, estimating the magnitude of the change of parameters over the population is, in a minimax sense, as difficult as estimating the full joint distribution itself.
Due to the recent advancements in wearables and sensing technology, health scientists are increasingly developing mobile health (mHealth) interventions. In mHealth interventions, mobile devices are used to deliver treatment to individuals as they go about their daily lives. These treatments are generally designed to impact a near time, proximal outcome such as stress or physical activity. The mHealth intervention policies, often called just-in-time adaptive interventions, are decision rules that map an individuals current state (e.g., individuals past behaviors as well as current observations of time, location, social activity, stress and urges to smoke) to a particular treatment at each of many time points. The vast majority of current mHealth interventions deploy expert-derived policies. In this paper, we provide an approach for conducting inference about the performance of one or more such policies using historical data collected under a possibly different policy. Our measure of performance is the average of proximal outcomes over a long time period should the particular mHealth policy be followed. We provide an estimator as well as confidence intervals. This work is motivated by HeartSteps, an mHealth physical activity intervention.
Structured statistical estimation problems are often solved by Conditional Gradient (CG) type methods to avoid the computationally expensive projection operation. However, the existing CG type methods are not robust to data corruption. To address this, we propose to robustify CG type methods against Hubers corruption model and heavy-tailed data. First, we show that the two Pairwise CG methods are stable, i.e., do not accumulate error. Combined with robust mean gradient estimation techniques, we can therefore guarantee robustness to a wide class of problems, but now in a projection-free algorithmic framework. Next, we consider high dimensional problems. Robust mean estimation based approaches may have an unacceptably high sample complexity. When the constraint set is a $ell_0$ norm ball, Iterative-Hard-Thresholding-based methods have been developed recently. Yet extension is non-trivial even for general sets with $O(d)$ extreme points. For setting where the feasible set has $O(text{poly}(d))$ extreme points, we develop a novel robustness method, based on a new condition we call the Robust Atom Selection Condition (RASC). When RASC is satisfied, our method converges linearly with a corresponding statistical error, with sample complexity that scales correctly in the sparsity of the problem, rather than the ambient dimension as would be required by any approach based on robust mean estimation.