ترغب بنشر مسار تعليمي؟ اضغط هنا

126 - Jiatai Huang , Longbo Huang 2021
We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achi eving $tilde{O}(sqrt{T} + sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $tilde{O}(text{poly}(n)(sqrt{T} + sqrt{D}))$ regret.
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes, with the goal of achieving $widetilde{mathcal{O}}(sqrt{T})$ regret when the losses are adversarial and simultaneously $mathcal{O}(text{ polylog}(T))$ regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($text{FTRL}$) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the $text{FTRL}$ analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by (a fraction of) the regret itself in the stochastic setting, a key property to ensure $mathcal{O}(text{polylog}(T))$ regret.
In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learners reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CB Ps and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees ($T$-independent additive regrets instead of $T$-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with $O(1/T)$ normalized additive regrets ($T$-independent in the cumulative form) and validate this result through empirical evaluation.
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin D elayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.
We study multi-agent reinforcement learning (MARL) in a time-varying network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and time-varying, and provide a finite-time error bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of RL in networked systems.
Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In t he exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Renyi entropy over the state-action space and justify this objective theoretically. The success of using Renyi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.
Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. We propose an efficient exploration method, Multi-Path Policy Optimization (MPPO), which does not incur high computation cost and ensures stability. MPPO maintains an efficient mechanism that effectively utilizes a population of diverse policies to enable better exploration, especially in sparse environments. We also give a theoretical guarantee of the stable performance. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization algorithm and Proximal Policy Optimization algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MPPO significantly outperforms state-of-the-art exploration methods in terms of both sample efficiency and final performance.
Loyalty programs are important tools for sharing platforms seeking to grow supply. Online sharing platforms use loyalty programs to heavily subsidize resource providers, encouraging participation and boosting supply. As the sharing economy has evolve d and competition has increased, the design of loyalty programs has begun to play a crucial role in the pursuit of maximal revenue. In this paper, we first characterize the optimal loyalty program for a platform with homogeneous users. We then show that optimal revenue in a heterogeneous market can be achieved by a class of multi-threshold loyalty program (MTLP) which admits a simple implementation-friendly structure. We also study the performance of loyalty programs in a setting with two competing sharing platforms, showing that the degree of heterogeneity is a crucial factor for both loyalty programs and pricing strategies. Our results show that sophisticated loyalty programs that reward suppliers via stepwise linear functions outperform simple sign-up bonuses, which give them a one time reward for participating.
We investigate the problem of stochastic network optimization in the presence of imperfect state prediction and non-stationarity. Based on a novel distribution-accuracy curve prediction model, we develop the predictive learning-aided control (PLC) al gorithm, which jointly utilizes historic and predicted network state information for decision making. PLC is an online algorithm that requires zero a-prior system statistical information, and consists of three key components, namely sequential distribution estimation and change detection, dual learning, and online queue-based control. Specifically, we show that PLC simultaneously achieves good long-term performance, short-term queue size reduction, accurate change detection, and fast algorithm convergence. In particular, for stationary networks, PLC achieves a near-optimal $[O(epsilon)$, $O(log(1/epsilon)^2)]$ utility-delay tradeoff. For non-stationary networks, plc{} obtains an $[O(epsilon), O(log^2(1/epsilon)$ $+ min(epsilon^{c/2-1}, e_w/epsilon))]$ utility-backlog tradeoff for distributions that last $Theta(frac{max(epsilon^{-c}, e_w^{-2})}{epsilon^{1+a}})$ time, where $e_w$ is the prediction accuracy and $a=Theta(1)>0$ is a constant (the Backpressue algorithm cite{neelynowbook} requires an $O(epsilon^{-2})$ length for the same utility performance with a larger backlog). Moreover, PLC detects distribution change $O(w)$ slots faster with high probability ($w$ is the prediction size) and achieves an $O(min(epsilon^{-1+c/2}, e_w/epsilon)+log^2(1/epsilon))$ convergence time. Our results demonstrate that state prediction (even imperfect) can help (i) achieve faster detection and convergence, and (ii) obtain better utility-delay tradeoffs.
292 - Zhixuan Fang , Longbo Huang 2016
In this paper, we investigate the effect of brand in market competition. Specifically, we propose a variant Hotelling model where companies and customers are represented by points in an Euclidean space, with axes being product features. $N$ companies compete to maximize their own profits by optimally choosing their prices, while each customer in the market, when choosing sellers, considers the sum of product price, discrepancy between product feature and his preference, and a companys brand name, which is modeled by a function of its market area of the form $-betacdottext{(Market Area)}^q$, where $beta$ captures the brand influence and $q$ captures how market share affects the brand. By varying the parameters $beta$ and $q$, we derive existence results of Nash equilibrium and equilibrium market prices and shares. In particular, we prove that pure Nash equilibrium always exists when $q=0$ for markets with either one and two dominating features, and it always exists in a single dominating feature market when market affects brand name linearly, i.e., $q=1$. Moreover, we show that at equilibrium, a companys price is proportional to its market area over the competition intensity with its neighbors, a result that quantitatively reconciles the common belief of a companys pricing power. We also study an interesting wipe out phenomenon that only appears when $q>0$, which is similar to the undercut phenomenon in the Hotelling model, where companies may suddenly lose the entire market area with a small price increment. Our results offer novel insight into market pricing and positioning under competition with brand effect.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا