Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

110 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ethan Fang

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Yining Wang - Yi Chen - Ethan X. Fang

التعلم الالي التعلم الآلي نظرية الإحصاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves $ tilde{mathcal{O}}(ssqrt{T})$ regret with high probability, which is nearly independent in the ``ambient regression model dimension $d$. We further attain a sharper $tilde{mathcal{O}}(sqrt{sT})$ regret by using the textsc{SupLinUCB} framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.

قيم البحث

اقرأ أيضاً

High-Dimensional Sparse Linear Bandits

150 - Botao Hao , Tor Lattimore , Mengdi Wang 2020

Stochastic linear bandits with high-dimensional sparse features are a practical model for a variety of domains, including personalized medicine and online advertising. We derive a novel $Omega(n^{2/3})$ dimension-free minimax regret lower bound for s parse linear bandits in the data-poor regime where the horizon is smaller than the ambient dimension and where the feature vectors admit a well-conditioned exploration distribution. This is complemented by a nearly matching upper bound for an explore-then-commit algorithm showing that that $Theta(n^{2/3})$ is the optimal rate in the data-poor regime. The results complement existing bounds for the data-rich regime and provide another example where carefully balancing the trade-off between information and regret is necessary. Finally, we prove a dimension-free $O(sqrt{n})$ regret upper bound under an additional assumption on the magnitude of the signal for relevant features.

التعلم الالي التعلم الآلي نظرية الإحصاء

Information Directed Sampling for Sparse Linear Bandits

91 - Botao Hao , Tor Lattimore , Wei Deng 2021

Stochastic sparse linear bandits offer a practical model for high-dimensional online decision-making problems and have a rich information-regret structure. In this work we explore the use of information-directed sampling (IDS), which naturally balanc es the information-regret trade-off. We develop a class of information-theoretic Bayesian regret bounds that nearly match existing lower bounds on a variety of problem instances, demonstrating the adaptivity of IDS. To efficiently implement sparse IDS, we propose an empirical Bayesian approach for sparse posterior sampling using a spike-and-slab Gaussian-Laplace prior. Numerical results demonstrate significant regret reductions by sparse IDS relative to several baselines.

التعلم الالي التعلم الآلي نظرية الإحصاء

Causal Feature Selection via Orthogonal Search

101 - Anant Raj , Stefan Bauer , Ashkan Soleymani 2020

The problem of inferring the direct causal parents of a response variable among a large set of explanatory variables is of high practical importance in many disciplines. Recent work in the field of causal discovery exploits invariance properties of m odels across different experimental conditions for detecting direct causal links. However, these approaches generally do not scale well with the number of explanatory variables, are difficult to extend to nonlinear relationships, and require data across different experiments. Inspired by {em Debiased} machine learning methods, we study a one-vs.-the-rest feature selection approach to discover the direct causal parent of the response. We propose an algorithm that works for purely observational data, while also offering theoretical guarantees, including the case of partially nonlinear relationships. Requiring only one estimation for each variable, we can apply our approach even to large graphs, demonstrating significant improvements compared to established approaches.

التعلم الالي التعلم الآلي نظرية الإحصاء

Bandit Phase Retrieval

87 - Tor Lattimore , Botao Hao 2021

We study a bandit version of phase retrieval where the learner chooses actions $(A_t)_{t=1}^n$ in the $d$-dimensional unit ball and the expected reward is $langle A_t, theta_starrangle^2$ where $theta_star in mathbb R^d$ is an unknown parameter vecto r. We prove that the minimax cumulative regret in this problem is $smash{tilde Theta(d sqrt{n})}$, which improves on the best known bounds by a factor of $smash{sqrt{d}}$. We also show that the minimax simple regret is $smash{tilde Theta(d / sqrt{n})}$ and that this is only achievable by an adaptive algorithm. Our analysis shows that an apparently convincing heuristic for guessing lower bounds can be misleading and that uniform bounds on the information ratio for information-directed sampling are not sufficient for optimal regret.

التعلم الالي التعلم الآلي نظرية الإحصاء

Online Model Selection: a Rested Bandit Formulation

224 - Leonardo Cella , Claudio Gentile , Massimiliano Pontil 2020

Motivated by a natural problem in online model selection with bandit information, we introduce and analyze a best arm identification problem in the rested bandit setting, wherein arm expected losses decrease with the number of times the arm has been played. The shape of the expected loss functions is similar across arms, and is assumed to be available up to unknown parameters that have to be learned on the fly. We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game. We analyze an arm elimination algorithm whose regret vanishes as the time horizon increases. The actual rate of convergence depends in a detailed way on the postulated functional form of the expected losses. Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function so as to identify the best arm as quickly as possible. We complement our analysis with a lower bound, indicating strengths and limitations of the proposed solution.

التعلم الالي التعلم الآلي