ﻻ يوجد ملخص باللغة العربية
Many sequential decision-making tasks require choosing at each decision step the right action out of the vast set of possibilities by extracting actionable intelligence from high-dimensional data streams. Most of the times, the high-dimensionality of actions and data makes learning of the optimal actions by traditional learning methods impracticable. In this work, we investigate how to discover and leverage sparsity in actions and data to enable fast learning. As our learning model, we consider a structured contextual multi-armed bandit (CMAB) with high-dimensional arm (action) and context (data) sets, where the rewards depend only on a few relevant dimensions of the joint context-arm set, possibly in a non-linear way. We depart from the prior work by assuming a high-dimensional, continuum set of arms, and allow relevant context dimensions to vary for each arm. We propose a new online learning algorithm called {em CMAB with Relevance Learning} (CMAB-RL) and prove that its time-averaged regret asymptotically goes to zero when the expected reward varies smoothly in contexts and arms. CMAB-RL enjoys a substantially improved regret bound compared to classical CMAB algorithms whose regrets depend on dimensions $d_x$ and $d_a$ of the context and arm sets. Importantly, we show that when the learner has prior knowledge on sparsity, given in terms of upper bounds $overline{d}_x$ and $overline{d}_a$ on the number of relevant dimensions, then CMAB-RL achieves $tilde{O}(T^{1-1/(2+2overline{d}_x +overline{d}_a)})$ regret. Finally, we illustrate how CMAB algorithms can be used for optimal personalized blood glucose control in type 1 diabetes mellitus patients, and show that CMAB-RL outperforms other contextual MAB algorithms in this task.
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seei
We study the design of learning architectures for behavioural planning in a dense traffic setting. Such architectures should deal with a varying number of nearby vehicles, be invariant to the ordering chosen to describe them, while staying accurate a
Nowadays fairness issues have raised great concerns in decision-making systems. Various fairness notions have been proposed to measure the degree to which an algorithm is unfair. In practice, there frequently exist a certain set of variables we term
We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process p
We consider an online revenue maximization problem over a finite time horizon subject to lower and upper bounds on cost. At each period, an agent receives a context vector sampled i.i.d. from an unknown distribution and needs to make a decision adapt