No Arabic abstract
Archetypal analysis is an unsupervised learning method that uses a convex polytope to summarize multivariate data. For fixed $k$, the method finds a convex polytope with $k$ vertices, called archetype points, such that the polytope is contained in the convex hull of the data and the mean squared distance between the data and the polytope is minimal. In this paper, we prove a consistency result that shows if the data is independently sampled from a probability measure with bounded support, then the archetype points converge to a solution of the continuum version of the problem, of which we identify and establish several properties. We also obtain the convergence rate of the optimal objective values under appropriate assumptions on the distribution. If the data is independently sampled from a distribution with unbounded support, we also prove a consistency result for a modified method that penalizes the dispersion of the archetype points. Our analysis is supported by detailed computational experiments of the archetype points for data sampled from the uniform distribution in a disk, the normal distribution, an annular distribution, and a Gaussian mixture model.
Knowledge gradient is a design principle for developing Bayesian sequential sampling policies to solve optimization problems. In this paper we consider the ranking and selection problem in the presence of covariates, where the best alternative is not universal but depends on the covariates. In this context, we prove that under minimal assumptions, the sampling policy based on knowledge gradient is consistent, in the sense that following the policy the best alternative as a function of the covariates will be identified almost surely as the number of samples grows. We also propose a stochastic gradient ascent algorithm for computing the sampling policy and demonstrate its performance via numerical experiments.
In stochastic decision problems, one often wants to estimate the underlying probability measure statistically, and then to use this estimate as a basis for decisions. We shall consider how the uncertainty in this estimation can be explicitly and consistently incorporated in the valuation of decisions, using the theory of nonlinear expectations.
The classical asymptotic theory for parametric $M$-estimators guarantees that, in the limit of infinite sample size, the excess risk has a chi-square type distribution, even in the misspecified case. We demonstrate how self-concordance of the loss allows to characterize the critical sample size sufficient to guarantee a chi-square type in-probability bound for the excess risk. Specifically, we consider two classes of losses: (i) self-concordant losses in the classical sense of Nesterov and Nemirovski, i.e., whose third derivative is uniformly bounded with the $3/2$ power of the second derivative; (ii) pseudo self-concordant losses, for which the power is removed. These classes contain losses corresponding to several generalized linear models, including the logistic loss and pseudo-Huber losses. Our basic result under minimal assumptions bounds the critical sample size by $O(d cdot d_{text{eff}}),$ where $d$ the parameter dimension and $d_{text{eff}}$ the effective dimension that accounts for model misspecification. In contrast to the existing results, we only impose local assumptions that concern the population risk minimizer $theta_*$. Namely, we assume that the calibrated design, i.e., design scaled by the square root of the second derivative of the loss, is subgaussian at $theta_*$. Besides, for type-ii losses we require boundedness of a certain measure of curvature of the population risk at $theta_*$.Our improved result bounds the critical sample size from above as $O(max{d_{text{eff}}, d log d})$ under slightly stronger assumptions. Namely, the local assumptions must hold in the neighborhood of $theta_*$ given by the Dikin ellipsoid of the population risk. Interestingly, we find that, for logistic regression with Gaussian design, there is no actual restriction of conditions: the subgaussian parameter and curvature measure remain near-constant over the Dikin ellipsoid. Finally, we extend some of these results to $ell_1$-penalized estimators in high dimensions.
Recently proposed numerical algorithms for solving high-dimensional nonlinear partial differential equations (PDEs) based on neural networks have shown their remarkable performance. We review some of them and study their convergence properties. The methods rely on probabilistic representation of PDEs by backward stochastic differential equations (BSDEs) and their iterated time discretization. Our proposed algorithm, called deep backward multistep scheme (MDBDP), is a machine learning version of the LSMDP scheme of Gobet, Turkedjiev (Math. Comp. 85, 2016). It estimates simultaneously by backward induction the solution and its gradient by neural networks through sequential minimizations of suitable quadratic loss functions that are performed by stochastic gradient descent. Our main theoretical contribution is to provide an approximation error analysis of the MDBDP scheme as well as the deep splitting (DS) scheme for semilinear PDEs designed in Beck, Becker, Cheridito, Jentzen, Neufeld (2019). We also supplement the error analysis of the DBDP scheme of Hur{e}, Pham, Warin (Math. Comp. 89, 2020). This yields notably convergence rate in terms of the number of neurons for a class of deep Lipschitz continuous GroupSort neural networks when the PDE is linear in the gradient of the solution for the MDBDP scheme, and in the semilinear case for the DBDP scheme. We illustrate our results with some numerical tests that are compared with some other machine learning algorithms in the literature.
Archetypal analysis is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of archetypal analysis in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that, provided the data is approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of archetypal analysis. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of archetypal analysis. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.