ترغب بنشر مسار تعليمي؟ اضغط هنا

Deterministic Policy Gradient (DPG) removes a level of randomness from standard randomized-action Policy Gradient (PG), and demonstrates substantial empirical success for tackling complex dynamic problems involving Markov decision processes. At the s ame time, though, DPG loses its ability to learn in a model-free (i.e., actor-only) fashion, frequently necessitating the use of critics in order to obtain consistent estimates of the associated policy-reward gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates policy-reward gradients via two-point stochastic evaluations of the $Q$-function, constructed by properly designed low-dimensional action-space perturbations. Exploiting the idea of random horizon rollouts for obtaining unbiased estimates of the $Q$-function, ZDPG lifts the dependence on critics and restores true model-free policy learning, while enjoying built-in and provable algorithmic stability. Additionally, we present new finite sample complexity bounds for ZDPG, which improve upon existing results by up to two orders of magnitude. Our findings are supported by several numerical experiments, which showcase the effectiveness of ZDPG in a practical setting, and its advantages over both PG and Baseline PG.
We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successiv e elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is $delta$-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.
We propose a new risk-constrained reformulation of the standard Linear Quadratic Regulator (LQR) problem. Our framework is motivated by the fact that the classical (risk-neutral) LQR controller, although optimal in expectation, might be ineffective u nder relatively infrequent, yet statistically significant (risky) events. To effectively trade between average and extreme event performance, we introduce a new risk constraint, which explicitly restricts the total expected predictive variance of the state penalty by a user-prescribed level. We show that, under rather minimal conditions on the process noise (i.e., finite fourth-order moments), the optimal risk-aware controller can be evaluated explicitly and in closed form. In fact, it is affine relative to the state, and is always internally stable regardless of parameter tuning. Our new risk-aware controller: i) pushes the state away from directions where the noise exhibits heavy tails, by exploiting the third-order moment (skewness) of the noise; ii) inflates the state penalty in riskier directions, where both the noise covariance and the state penalty are simultaneously large. The properties of the proposed risk-aware LQR framework are also illustrated via indicative numerical examples.
We present Free-MESSAGEp, the first zeroth-order algorithm for convex mean-semideviation-based risk-aware learning, which is also the first three-level zeroth-order compositional stochastic optimization algorithm, whatsoever. Using a non-trivial exte nsion of Nesterovs classical results on Gaussian smoothing, we develop the Free-MESSAGEp algorithm from first principles, and show that it essentially solves a smoothed surrogate to the original problem, the former being a uniform approximation of the latter, in a useful, convenient sense. We then present a complete analysis of the Free-MESSAGEp algorithm, which establishes convergence in a user-tunable neighborhood of the optimal solutions of the original problem, as well as explicit convergence rates for both convex and strongly convex costs. Orderwise, and for fixed problem parameters, our results demonstrate no sacrifice in convergence speed compared to existing first-order methods, while striking a certain balance among the condition of the problem, its dimensionality, as well as the accuracy of the obtained results, naturally extending previous results in zeroth-order risk-neutral learning.
Despite the simplicity and intuitive interpretation of Minimum Mean Squared Error (MMSE) estimators, their effectiveness in certain scenarios is questionable. Indeed, minimizing squared errors on average does not provide any form of stability, as the volatility of the estimation error is left unconstrained. When this volatility is statistically significant, the difference between the average and realized performance of the MMSE estimator can be drastically different. To address this issue, we introduce a new risk-aware MMSE formulation which trades between mean performance and risk by explicitly constraining the expected predictive variance of the involved squared error. We show that, under mild moment boundedness conditions, the corresponding risk-aware optimal solution can be evaluated explicitly, and has the form of an appropriately biased nonlinear MMSE estimator. We further illustrate the effectiveness of our approach via several numerical examples, which also showcase the advantages of risk-aware MMSE estimation against risk-neutral MMSE estimation, especially in models involving skewed, heavy-tailed distributions.
Learning optimal resource allocation policies in wireless systems can be effectively achieved by formulating finite dimensional constrained programs which depend on system configuration, as well as the adopted learning parameterization. The interest here is in cases where system models are unavailable, prompting methods that probe the wireless system with candidate policies, and then use observed performance to determine better policies. This generic procedure is difficult because of the need to cull accurate gradient estimates out of these limited system queries. This paper constructs and exploits smoothed surrogates of constrained ergodic resource allocation problems, the gradients of the former being representable exactly as averages of finite differences that can be obtained through limited system probing. Leveraging this unique property, we develop a new model-free primal-dual algorithm for learning optimal ergodic resource allocations, while we rigorously analyze the relationships between original policy search problems and their surrogates, in both primal and dual domains. First, we show that both primal and dual domain surrogates are uniformly consistent approximations of their corresponding original finite dimensional counterparts. Upon further assuming the use of near-universal policy parameterizations, we also develop explicit bounds on the gap between optimal values of initial, infinite dimensional resource allocation problems, and dual values of their parameterized smoothed surrogates. In fact, we show that this duality gap decreases at a linear rate relative to smoothing and universality parameters. Thus, it can be made arbitrarily small at will, also justifying our proposed primal-dual algorithmic recipe. Numerical simulations confirm the effectiveness of our approach.
We provide high probability finite sample complexity guarantees for hidden non-parametric structure learning of tree-shaped graphical models, whose hidden and observable nodes are discrete random variables with either finite or countable alphabets. W e study a fundamental quantity called the (noisy) information threshold, which arises naturally from the error analysis of the Chow-Liu algorithm and, as we discuss, provides explicit necessary and sufficient conditions on sample complexity, by effectively summarizing the difficulty of the tree-structure learning problem. Specifically, we show that the finite sample complexity of the Chow-Liu algorithm for ensuring exact structure recovery from noisy data is inversely proportional to the information threshold squared (provided it is positive), and scales almost logarithmically relative to the number of nodes over a given probability of failure. Conversely, we show that, if the number of samples is less than an absolute constant times the inverse of information threshold squared, then no algorithm can recover the hidden tree structure with probability greater than one half. As a consequence, our upper and lower bounds match with respect to the information threshold, indicating that it is a fundamental quantity for the problem of learning hidden tree-structured models. Further, the Chow-Liu algorithm with noisy data as input achieves the optimal rate with respect to the information threshold. Lastly, as a byproduct of our analysis, we resolve the problem of tree structure learning in the presence of non-identically distributed observation noise, providing conditions for convergence of the Chow-Liu algorithm under this setting, as well.
While millimeter wave (mmWave) communications promise high data rates, their sensitivity to blockage and severe signal attenuation presents challenges in their deployment in urban settings. To overcome these effects, we consider a distributed coopera tive beamforming system, which relies on static relays deployed in clusters with similar channel characteristics, and where, at every time instance, only one relay from each cluster is selected to participate in beamforming to the destination. To meet the quality-of-service guarantees of the network, a key prerequisite for beamforming is relay selection. However, as the channels change with time, relay selection becomes a resource demanding task. Indeed, estimation of channel state information for all candidate relays, essential for relay selection, is a process that takes up bandwidth, wastes power and introduces latency and interference in the network. We instead propose a unique, predictive scheme for resource efficient relay selection, which exploits the special propagation patterns of the mmWave medium, and can be executed distributively across clusters, and in parallel to optimal beamforming-based communication. The proposed predictive scheme efficiently exploits spatiotemporal channel correlations with current and past networkwide Received Signal Strength (RSS), the latter being invariant to relay cluster size, measured sequentially during the operation of the system. Our numerical results confirm that our proposed relay selection strategy outperforms any randomized selection policy that does not exploit channel correlations, whereas, at the same time, it performs very close to an ideal scheme that uses complete, cluster size dependent RSS, and offers significant savings in terms of channel estimation overhead, providing substantially better network utilization, especially in dense topologies, typical in mmWave networks.
The use of millimeter wave (mmWave) spectrum for commercial wireless communications is expected to offer data rates in the order of Gigabits-per-second, thus able to support future applications such as Vehicle-to-Vehicle or Vehicle-to-Infrastructure communication. However, especially in urban settings, mmWave signal propagation is sensitive to blockage by surrounding objects, resulting in significant signal attenuation. One approach to mitigate the effect of attenuation is through multi-hop communication with the help of relays. Leveraging the unique characteristics of the mmWave medium, we consider a single-source/destination $2$-hop system, where a cluster of spatially distributed and reconfigurable relays is used to cooperatively amplify-and-forward the source signal to the destination. Our system evolves in time slots, during which not only are optimal beamforming weights centrally determined, but also future relay positions for the subsequent time slot are optimally selected, jointly maximizing the expected signal-to-interference+noise ratio at the destination. Optimal predictive relay positioning is achieved by formulating a 2-stage stochastic programming problem, which is efficiently approximated via a conditional sample-average-approximation surrogate, and solved in a purely distributed fashion across relays. The efficacy of the proposed near-optimal positioning policy is corroborated by comparison against a randomized relay positioning policy, clearly confirming its superiority.
We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising mo del distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each bit independently with some constant probability $qin [0,1/2)$). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantifies how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case ($q=0$). In fact, for any tree with $p$ vertices and probability of incorrect recovery $delta>0$, the sufficient number of samples remains logarithmic as in the noiseless case, i.e., $mathcal{O}(log(p/delta))$, while the dependence on $q$ is $mathcal{O}big( 1/(1-2q)^{4} big)$, for both aforementioned tasks. We also present a new equivalent of Isserlis Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا