A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

138 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Priyank Agrawal

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Priyank Agrawal - Vashist Avadhanula - Theja Tulabandhula

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where in every round a decision maker offers a subset (assortment) of products to a consumer, and observes their response. Consumers purchase products so as to maximize their utility. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior by means of the widely used Multinomial Logit (MNL) model, and consider the decision makers problem of dynamically learning the model parameters, while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem and their theoretical performance guarantees depend on a problem dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(sqrt{kappa d T})$, where $kappa$ is a problem dependent constant that can have exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(sqrt{dT} + kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step which allows for tractable decision-making while retaining the favourable regret guarantee.

قيم البحث

146 - Kefan Dong , Yingkai Li , Qin Zhang 2020

We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret. We propose two measures of adaptivity: the assortment switching cost and the more fine-grained item switching cost. We present an anytime algorithm (AT-DUCB) with $O(N log T)$ assortment switches, almost matching the lower bound $Omega(frac{N log T}{ log log T})$. In the fixed-horizon setting, our algorithm FH-DUCB incurs $O(N log log T)$ assortment switches, matching the asymptotic lower bound. We also present the ESUCB algorithm with item switching cost $O(N log^2 T)$.

التعلم الآلي التعلم الالي

Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem

137 - Junyu Cao , Wei Sun 2019

Motivated by the phenomenon that companies introduce new products to keep abreast with customers rapidly changing tastes, we consider a novel online learning setting where a profit-maximizing seller needs to learn customers preferences through offeri ng recommendations, which may contain existing products and new products that are launched in the middle of a selling period. We propose a sequential multinomial logit (SMNL) model to characterize customers behavior when product recommendations are presented in tiers. For the offline version with known customers preferences, we propose a polynomial-time algorithm and characterize the properties of the optimal tiered product recommendation. For the online problem, we propose a learning algorithm and quantify its regret bound. Moreover, we extend the setting to incorporate a constraint which ensures every new product is learned to a given accuracy. Our results demonstrate the tier structure can be used to mitigate the risks associated with learning new products.

التعلم الآلي التعلم الالي

A Smoothed Analysis of Online Lasso for the Sparse Linear Contextual Bandit Problem

133 - Zhiyuan Liu , Huazheng Wang , Bo Waggoner 2020

We investigate the sparse linear contextual bandit problem where the parameter $theta$ is sparse. To relieve the sampling inefficiency, we utilize the perturbed adversary where the context is generated adversarilly but with small random non-adaptive perturbations. We prove that the simple online Lasso supports sparse linear contextual bandit with regret bound $mathcal{O}(sqrt{kTlog d})$ even when $d gg T$ where $k$ and $d$ are the number of effective and ambient dimension, respectively. Compared to the recent work from Sivakumar et al. (2020), our analysis does not rely on the precondition processing, adaptive perturbation (the adaptive perturbation violates the i.i.d perturbation setting) or truncation on the error set. Moreover, the special structures in our results explicitly characterize how the perturbation affects exploration length, guide the design of perturbation together with the fundamental performance limit of perturbation method. Numerical experiments are provided to complement the theoretical analysis.

التعلم الآلي التعلم الالي

Online Algorithm for Unsupervised Sequential Selection with Contextual Information

86 - Arun Verma , Manjesh K. Hanawal , Csaba Szepesvari 2020

In this paper, we study Contextual Unsupervised Sequential Selection (USS), a new variant of the stochastic contextual bandits problem where the loss of an arm cannot be inferred from the observed feedback. In our setup, arms are associated with fixe d costs and are ordered, forming a cascade. In each round, a context is presented, and the learner selects the arms sequentially till some depth. The total cost incurred by stopping at an arm is the sum of fixed costs of arms selected and the stochastic loss associated with the arm. The learners goal is to learn a decision rule that maps contexts to arms with the goal of minimizing the total expected loss. The problem is challenging as we are faced with an unsupervised setting as the total loss cannot be estimated. Clearly, learning is feasible only if the optimal arm can be inferred (explicitly or implicitly) from the problem structure. We observe that learning is still possible when the problem instance satisfies the so-called Contextual Weak Dominance (CWD) property. Under CWD, we propose an algorithm for the contextual USS problem and demonstrate that it has sub-linear regret. Experiments on synthetic and real datasets validate our algorithm.

التعلم الآلي التعلم الالي

Lessons from Contextual Bandit Learning in a Customer Support Bot

210 - Nikos Karampatziakis , Sebastian Kochman , Jade Huang 2019

In this work, we describe practical lessons we have learned from successfully using contextual bandits (CBs) to improve key business metrics of the Microsoft Virtual Agent for customer support. While our current use cases focus on single step einforc ement learning (RL) and mostly in the domain of natural language processing and information retrieval we believe many of our findings are generally applicable. Through this article, we highlight certain issues that RL practitioners may encounter in similar types of applications as well as offer practical solutions to these challenges.

التعلم الآلي التعلم الالي