ترغب بنشر مسار تعليمي؟ اضغط هنا

Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect reward s from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of Purely Periodic Policies. For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-mathcal O(1/sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $widetilde{mathcal O}(Nsqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.
The prevalence of e-commerce has made detailed customers personal information readily accessible to retailers, and this information has been widely used in pricing decisions. When involving personalized information, how to protect the privacy of such information becomes a critical issue in practice. In this paper, we consider a dynamic pricing problem over $T$ time periods with an emph{unknown} demand function of posted price and personalized information. At each time $t$, the retailer observes an arriving customers personal information and offers a price. The customer then makes the purchase decision, which will be utilized by the retailer to learn the underlying demand function. There is potentially a serious privacy concern during this process: a third party agent might infer the personalized information and purchase decisions from price changes from the pricing system. Using the fundamental framework of differential privacy from computer science, we develop a privacy-preserving dynamic pricing policy, which tries to maximize the retailer revenue while avoiding information leakage of individual customers information and purchasing decisions. To this end, we first introduce a notion of emph{anticipating} $(varepsilon, delta)$-differential privacy that is tailored to dynamic pricing problem. Our policy achieves both the privacy guarantee and the performance guarantee in terms of regret. Roughly speaking, for $d$-dimensional personalized information, our algorithm achieves the expected regret at the order of $tilde{O}(varepsilon^{-1} sqrt{d^3 T})$, when the customers information is adversarially chosen. For stochastic personalized information, the regret bound can be further improved to $tilde{O}(sqrt{d^2T} + varepsilon^{-2} d^2)$
Motivated by the episodic version of the classical inventory control problem, we propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problem s in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs $ tilde{mathcal{O}}(H^3sqrt{ T})$ regret and FQL incurs $tilde{mathcal{O}}(H^2sqrt{ T})$ regret, where $H$ is the length of each episode and $T$ is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.
We study in this paper a revenue management problem with add-on discounts. The problem is motivated by the practice in the video game industry, where a retailer offers discounts on selected supportive products (e.g. video games) to customers who have also purchased the core products (e.g. video game consoles). We formulate this problem as an optimization problem to determine the prices of different products and the selection of products with add-on discounts. To overcome the computational challenge of this optimization problem, we propose an efficient FPTAS algorithm that can solve the problem approximately to any desired accuracy. Moreover, we consider the revenue management problem in the setting where the retailer has no prior knowledge of the demand functions of different products. To resolve this problem, we propose a UCB-based learning algorithm that uses the FPTAS optimization algorithm as a subroutine. We show that our learning algorithm can converge to the optimal algorithm that has access to the true demand functions, and we prove that the convergence rate is tight up to a certain logarithmic term. In addition, we conduct numerical experiments with the real-world transaction data we collect from a popular video gaming brands online store on Tmall.com. The experiment results illustrate our learning algorithms robust performance and fast convergence in various scenarios. We also compare our algorithm with the optimal policy that does not use any add-on discount, and the results show the advantages of using the add-on discount strategy in practice.
Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the `lingering of gradients: once a gradient is computed at $x_k$, the additional time to compute gradients at $x_{k+1},x_{k+2},dots$ may be reduced. We show how this improves the running time of several first-order methods. For instance, if the `additional time scales linearly with respect to the traveled distance, then the `convergence rate of gradient descent can be improved from $1/T$ to $exp(-T^{1/3})$. On the application side, we solve a hypothetical revenue management problem on the Yahoo! Front Page Today Module with 4.6m users to $10^{-6}$ error using only 6 passes of the dataset; and solve a real-life support vector machine problem to an accuracy that is two orders of magnitude better comparing to the state-of-the-art algorithm.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا