ترغب بنشر مسار تعليمي؟ اضغط هنا

Optimal Strategies for Graph-Structured Bandits

322   0   0.0 ( 0 )
 نشر من قبل Hassan Saber
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English
 تأليف Hassan Saber




اسأل ChatGPT حول البحث

We study a structured variant of the multi-armed bandit problem specified by a set of Bernoulli distributions $ u != !( u_{a,b})_{a in mathcal{A}, b in mathcal{B}}$ with means $(mu_{a,b})_{a in mathcal{A}, b in mathcal{B}}!in![0,1]^{mathcal{A}timesmathcal{B}}$ and by a given weight matrix $omega!=! (omega_{b,b})_{b,b in mathcal{B}}$, where $ mathcal{A}$ is a finite set of arms and $ mathcal{B} $ is a finite set of users. The weight matrix $omega$ is such that for any two users $b,b!in!mathcal{B}, text{max}_{ainmathcal{A}}|mu_{a,b} !-! mu_{a,b}| !leq! omega_{b,b} $. This formulation is flexible enough to capture various situations, from highly-structured scenarios ($omega!in!{0,1}^{mathcal{B}timesmathcal{B}}$) to fully unstructured setups ($omega!equiv! 1$).We consider two scenarios depending on whether the learner chooses only the actions to sample rewards from or both users and actions. We first derive problem-dependent lower bounds on the regret for this generic graph-structure that involves a structure dependent linear programming problem. Second, we adapt to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura (2015), and introduce the IMED-GS$^star$ algorithm. Interestingly, IMED-GS$^star$ does not require computing the solution of the linear programming problem more than about $log(T)$ times after $T$ steps, while being provably asymptotically optimal. Also, unlike existing bandit strategies designed for other popular structures, IMED-GS$^star$ does not resort to an explicit forced exploration scheme and only makes use of local counts of empirical events. We finally provide numerical illustration of our results that confirm the performance of IMED-GS$^star$.



قيم البحث

اقرأ أيضاً

A stochastic multi-user multi-armed bandit framework is used to develop algorithms for uncoordinated spectrum access. In contrast to prior work, it is assumed that rewards can be non-zero even under collisions, thus allowing for the number of users t o be greater than the number of channels. The proposed algorithm consists of an estimation phase and an allocation phase. It is shown that if every user adopts the algorithm, the system wide regret is order-optimal of order $O(log T)$ over a time-horizon of duration $T$. The regret guarantees hold for both the cases where the number of users is greater than or less than the number of channels. The algorithm is extended to the dynamic case where the number of users in the system evolves over time, and is shown to lead to sub-linear regret.
This paper studies the adversarial graphical contextual bandits, a variant of adversarial multi-armed bandits that leverage two categories of the most common side information: emph{contexts} and emph{side observations}. In this setting, a learning ag ent repeatedly chooses from a set of $K$ actions after being presented with a $d$-dimensional context vector. The agent not only incurs and observes the loss of the chosen action, but also observes the losses of its neighboring actions in the observation structures, which are encoded as a series of feedback graphs. This setting models a variety of applications in social networks, where both contexts and graph-structured side observations are available. Two efficient algorithms are developed based on texttt{EXP3}. Under mild conditions, our analysis shows that for undirected feedback graphs the first algorithm, texttt{EXP3-LGC-U}, achieves the regret of order $mathcal{O}(sqrt{(K+alpha(G)d)Tlog{K}})$ over the time horizon $T$, where $alpha(G)$ is the average emph{independence number} of the feedback graphs. A slightly weaker result is presented for the directed graph setting as well. The second algorithm, texttt{EXP3-LGC-IX}, is developed for a special class of problems, for which the regret is reduced to $mathcal{O}(sqrt{alpha(G)dTlog{K}log(KT)})$ for both directed as well as undirected feedback graphs. Numerical tests corroborate the efficiency of proposed algorithms.
The problem of verifying whether a multi-component system has anomalies or not is addressed. Each component can be probed over time in a data-driven manner to obtain noisy observations that indicate whether the selected component is anomalous or not. The aim is to minimize the probability of incorrectly declaring the system to be free of anomalies while ensuring that the probability of correctly declaring it to be safe is sufficiently large. This problem is modeled as an active hypothesis testing problem in the Neyman-Pearson setting. Component-selection and inference strategies are designed and analyzed in the non-asymptotic regime. For a specific class of homogeneous problems, stronger (with respect to prior work) non-asymptotic converse and achievability bounds are provided.
We propose a supervised learning approach for predicting an underlying graph from a set of graph signals. Our approach is based on linear regression. In the linear regression model, we predict edge-weights of a graph as the output, given a set of sig nal values on nodes of the graph as the input. We solve for the optimal regression coefficients using a relevant optimization problem that is convex and uses a graph-Laplacian based regularization. The regularization helps to promote a specific graph spectral profile of the graph signals. Simulation experiments demonstrate that our approach predicts well even in presence of outliers in input data.
78 - Lihong Li , Yu Lu , Dengyong Zhou 2017
Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search. Generalized linear models (logistical regression in particular) have demonstrated stronger performance than linear models in many appl ications where rewards are binary. However, most theoretical analyses on contextual bandits so far are on linear bandits. In this work, we propose an upper confidence bound based algorithm for generalized linear contextual bandits, which achieves an $tilde{O}(sqrt{dT})$ regret over $T$ rounds with $d$ dimensional feature vectors. This regret matches the minimax lower bound, up to logarithmic terms, and improves on the best previous result by a $sqrt{d}$ factor, assuming the number of arms is fixed. A key component in our analysis is to establish a new, sharp finite-sample confidence bound for maximum-likelihood estimates in generalized linear models, which may be of independent interest. We also analyze a simpler upper confidence bound algorithm, which is useful in practice, and prove it to have optimal regret for certain cases.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا