ﻻ يوجد ملخص باللغة العربية
The multi-armed bandit (MAB) is a classical online optimization model for the trade-off between exploration and exploitation. The traditional MAB is concerned with finding the arm that minimizes the mean cost. However, minimizing the mean does not take the risk of the problem into account. We now want to accommodate risk-averse decision makers. In this work, we introduce a coherent risk measure as the criterion to form a risk-averse MAB. In particular, we derive an index-based online sampling framework for the risk-averse MAB. We develop this framework in detail for three specific risk measures, i.e. the conditional value-at-risk, the mean-deviation and the shortfall risk measures. Under each risk measure, the convergence rate for the upper bound on the pseudo regret, defined as the difference between the expectation of the empirical risk based on the observation sequence and the true risk of the optimal arm, is established.
Assistive multi-armed bandit problems can be used to model team situations between a human and an autonomous system like a domestic service robot. To account for human biases such as the risk-aversion described in the Cumulative Prospect Theory, the
The early sections of this paper present an analysis of a Markov decision model that is known as the multi-armed bandit under the assumption that the utility function of the decision maker is either linear or exponential. The analysis includes effici
Multistage risk-averse optimal control problems with nested conditional risk mappings are gaining popularity in various application domains. Risk-averse formulations interpolate between the classical expectation-based stochastic and minimax optimal c
By exploiting the computing power and local data of distributed clients, federated learning (FL) features ubiquitous properties such as reduction of communication overhead and preserving data privacy. In each communication round of FL, the clients up
A sensing policy for the restless multi-armed bandit problem with stationary but unknown reward distributions is proposed. The work is presented in the context of cognitive radios in which the bandit problem arises when deciding which parts of the sp