No Arabic abstract
In machine learning and optimization community there are two main approaches for convex risk minimization problem, namely, the Stochastic Approximation (SA) and the Sample Average Approximation (SAA). In terms of oracle complexity (required number of stochastic gradient evaluations), both approaches are considered equivalent on average (up to a logarithmic factor). The total complexity depends on the specific problem, however, starting from work cite{nemirovski2009robust} it was generally accepted that the SA is better than the SAA. Nevertheless, in case of large-scale problems SA may run out of memory as storing all data on one machine and organizing online access to it can be impossible without communications with other machines. SAA in contradistinction to SA allows parallel/distributed calculations. In this paper, we shed new light on the comparison of SA and SAA for particular problem of calculating the population (regularized) Wasserstein barycenter of discrete measures. The conclusion is valid even for non-parallel (non-decentralized) setup.
We investigate the feasibility of sample average approximation (SAA) for general stochastic optimization problems, including two-stage stochastic programming without the relatively complete recourse assumption. Instead of analyzing problems with specific structures, we utilize results from the Vapnik-Chervonenkis (VC) dimension and Probably Approximately Correct learning to provide a general framework that offers explicit feasibility bounds for SAA solutions under minimal structural or distributional assumption. We show that, as long as the hypothesis class formed by the feasbible region has a finite VC dimension, the infeasibility of SAA solutions decreases exponentially with computable rates and explicitly identifiable accompanying constants. We demonstrate how our bounds apply more generally and competitively compared to existing results.
In this thesis, we consider the Wasserstein barycenter problem of discrete probability measures from computational and statistical sides in two scenarios: (I) the measures are given and we need to compute their Wasserstein barycenter, and (ii) the measures are generated from a probability distribution and we need to calculate the population barycenter of the distribution defined by the notion of Frechet mean. The statistical focus is estimating the sample size of measures necessary to calculate an approximation for Frechet mean (barycenter) of a probability distribution with a given precision. For empirical risk minimization approaches, the question of the regularization is also studied together with proposing a new regularization which contributes to the better complexity bounds in comparison with quadratic regularization. The computational focus is developing algorithms for calculating Wasserstein barycenters: both primal and dual algorithms which can be executed in a decentralized manner. The motivation for dual approaches is closed-forms for the dual formulation of entropy-regularized Wasserstein distances and their derivatives, whereas the primal formulation has closed-form expression only in some cases, e.g., for Gaussian measures. Moreover, the dual oracle returning the gradient of the dual representation for entropy-regularized Wasserstein distance can be computed for a cheaper price in comparison with the primal oracle returning the gradient of the entropy-regularized Wasserstein distance. The number of dual oracle calls, in this case, will also be less, i.e., the square root of the number of primal oracle calls. This explains the successful application of the first-order dual approaches for the Wasserstein barycenter problem.
With the increasing penetration of high-frequency sensors across a number of biological and physical systems, the abundance of the resulting observations offers opportunities for higher statistical accuracy of down-stream estimates, but their frequency results in a plethora of computational problems in data assimilation tasks. The high-frequency of these observations has been traditionally dealt with by using data modification strategies such as accumulation, averaging, and sampling. However, these data modification strategies will reduce the quality of the estimates, which may be untenable for many systems. Therefore, to ensure high-quality estimates, we adapt stochastic approximation methods to address the unique challenges of high-frequency observations in data assimilation. As a result, we are able to produce estimates that leverage all of the observations in a manner that avoids the aforementioned computational problems and preserves the statistical accuracy of the estimates.
In this paper, we consider multi-stage stochastic optimization problems with convex objectives and conic constraints at each stage. We present a new stochastic first-order method, namely the dynamic stochastic approximation (DSA) algorithm, for solving these types of stochastic optimization problems. We show that DSA can achieve an optimal ${cal O}(1/epsilon^4)$ rate of convergence in terms of the total number of required scenarios when applied to a three-stage stochastic optimization problem. We further show that this rate of convergence can be improved to ${cal O}(1/epsilon^2)$ when the objective function is strongly convex. We also discuss variants of DSA for solving more general multi-stage stochastic optimization problems with the number of stages $T > 3$. The developed DSA algorithms only need to go through the scenario tree once in order to compute an $epsilon$-solution of the multi-stage stochastic optimization problem. As a result, the memory required by DSA only grows linearly with respect to the number of stages. To the best of our knowledge, this is the first time that stochastic approximation type methods are generalized for multi-stage stochastic optimization with $T ge 3$.
We consider stochastic optimization problems where a smooth (and potentially nonconvex) objective is to be minimized using a stochastic first-order oracle. These type of problems arise in many settings from simulation optimization to deep learning. We present Retrospective Approximation (RA) as a universal sequential sample-average approximation (SAA) paradigm where during each iteration $k$, a sample-path approximation problem is implicitly generated using an adapted sample size $M_k$, and solved (with prior solutions as warm start) to an adapted error tolerance $epsilon_k$, using a deterministic method such as the line search quasi-Newton method. The principal advantage of RA is that decouples optimization from stochastic approximation, allowing the direct adoption of existing deterministic algorithms without modification, thus mitigating the need to redesign algorithms for the stochastic context. A second advantage is the obvious manner in which RA lends itself to parallelization. We identify conditions on ${M_k, k geq 1}$ and ${epsilon_k, kgeq 1}$ that ensure almost sure convergence and convergence in $L_1$-norm, along with optimal iteration and work complexity rates. We illustrate the performance of RA with line-search quasi-Newton on an ill-conditioned least squares problem, as well as an image classification problem using a deep convolutional neural net.