ترغب بنشر مسار تعليمي؟ اضغط هنا

It has been recently shown in the literature that the sample averages from online learning experiments are biased when used to estimate the mean reward. To correct the bias, off-policy evaluation methods, including importance sampling and doubly robu st estimators, typically calculate the propensity score, which is unavailable in this setting due to unknown reward distribution and the adaptive policy. This paper provides a procedure to debias the samples using bootstrap, which doesnt require the knowledge of the reward distribution at all. Numerical experiments demonstrate the effective bias reduction for samples generated by popular multi-armed bandit algorithms such as Explore-Then-Commit (ETC), UCB, Thompson sampling and $epsilon$-greedy. We also analyze and provide theoretical justifications for the procedure under the ETC algorithm, including the asymptotic convergence of the bias decay rate in the real and bootstrap worlds.
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infini te horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T^{2/3}sqrt{log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
In prescriptive analytics, the decision-maker observes historical samples of $(X, Y)$, where $Y$ is the uncertain problem parameter and $X$ is the concurrent covariate, without knowing the joint distribution. Given an additional covariate observation $x$, the goal is to choose a decision $z$ conditional on this observation to minimize the cost $mathbb{E}[c(z,Y)|X=x]$. This paper proposes a new distributionally robust approach under Wasserstein ambiguity sets, in which the nominal distribution of $Y|X=x$ is constructed based on the Nadaraya-Watson kernel estimator concerning the historical data. We show that the nominal distribution converges to the actual conditional distribution under the Wasserstein distance. We establish the out-of-sample guarantees and the computational tractability of the framework. Through synthetic and empirical experiments about the newsvendor problem and portfolio optimization, we demonstrate the strong performance and practical value of the proposed framework.
We study the problem when a firm sets prices for products based on the transaction data, i.e., which product past customers chose from an assortment and what were the historical prices that they observed. Our approach does not impose a model on the d istribution of the customers valuations and only assumes, instead, that purchase choices satisfy incentive-compatible constraints. The individual valuation of each past customer can then be encoded as a polyhedral set, and our approach maximizes the worst-case revenue assuming that new customers valuations are drawn from the empirical distribution implied by the collection of such polyhedra. We show that the optimal prices in this setting can be approximated at any arbitrary precision by solving a compact mixed-integer linear program. Moreover, we study the single-product case and relate it to the traditional model-based approach. We also design three approximation strategies that are of low computational complexity and interpretable. Comprehensive numerical studies based on synthetic and real data suggest that our pricing approach is uniquely beneficial when the historical data has a limited size or is susceptible to model misspecification.
We consider the revenue maximization problem for an online retailer who plans to display a set of products differing in their prices and qualities and rank them in order. The consumers have random attention spans and view the products sequentially be fore purchasing a ``satisficing product or leaving the platform empty-handed when the attention span gets exhausted. Our framework extends the cascade model in two directions: the consumers have random attention spans instead of fixed ones and the firm maximizes revenues instead of clicking probabilities. We show a nested structure of the optimal product ranking as a function of the attention span when the attention span is fixed and design a $1/e$-approximation algorithm accordingly for the random attention spans. When the conditional purchase probabilities are not known and may depend on consumer and product features, we devise an online learning algorithm that achieves $tilde{mathcal{O}}(sqrt{T})$ regret relative to the approximation algorithm, despite of the censoring of information: the attention span of a customer who purchases an item is not observable. Numerical experiments demonstrate the outstanding performance of the approximation and online learning algorithms.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا