A sensing policy for the restless multi-armed bandit problem with stationary but unknown reward distributions is proposed. The work is presented in the context of cognitive radios in which the bandit problem arises when deciding which parts of the spectrum to sense and exploit. It is shown that the proposed policy attains asymptotically logarithmic weak regret rate when the rewards are bounded independent and identically distributed or finite state Markovian. Simulation results verifying uniformly logarithmic weak regret are also presented. The proposed policy is a centrally coordinated index policy, in which the index of a frequency band is comprised of a sample mean term and a confidence term. The sample mean term promotes spectrum exploitation whereas the confidence term encourages exploration. The confidence term is designed such that the time interval between consecutive sensing instances of any suboptimal band grows exponentially. This exponential growth between suboptimal sensing time instances leads to logarithmically growing weak regret. Simulation results demonstrate that the proposed policy performs better than other similar methods in the literature.
This paper proposes using the uncertainty of information (UoI), measured by Shannons entropy, as a metric for information freshness. We consider a system in which a central monitor observes multiple binary Markov processes through a communication channel. The UoI of a Markov process corresponds to the monitors uncertainty about its state. At each time step, only one Markov process can be selected to update its state to the monitor; hence there is a tradeoff among the UoIs of the processes that depend on the scheduling policy used to select the process to be updated. The age of information (AoI) of a process corresponds to the time since its last update. In general, the associated UoI can be a non-increasing function, or even an oscillating function, of its AoI, making the scheduling problem particularly challenging. This paper investigates scheduling policies that aim to minimize the average sum-UoI of the processes over the infinite time horizon. We formulate the problem as a restless multi-armed bandit (RMAB) problem, and develop a Whittle index policy that is near-optimal for the RMAB after proving its indexability. We further provide an iterative algorithm to compute the Whittle index for the practical deployment of the policy. Although this paper focuses on UoI scheduling, our results apply to a general class of RMABs for which the UoI scheduling problem is a special case. Specifically, this papers Whittle index policy is valid for any RMAB in which the bandits are binary Markov processes and the penalty is a concave function of the belief state of the Markov process. Numerical results demonstrate the excellent performance of the Whittle index policy for this class of RMABs.
By exploiting the computing power and local data of distributed clients, federated learning (FL) features ubiquitous properties such as reduction of communication overhead and preserving data privacy. In each communication round of FL, the clients update local models based on their own data and upload their local updates via wireless channels. However, latency caused by hundreds to thousands of communication rounds remains a bottleneck in FL. To minimize the training latency, this work provides a multi-armed bandit-based framework for online client scheduling (CS) in FL without knowing wireless channel state information and statistical characteristics of clients. Firstly, we propose a CS algorithm based on the upper confidence bound policy (CS-UCB) for ideal scenarios where local datasets of clients are independent and identically distributed (i.i.d.) and balanced. An upper bound of the expected performance regret of the proposed CS-UCB algorithm is provided, which indicates that the regret grows logarithmically over communication rounds. Then, to address non-ideal scenarios with non-i.i.d. and unbalanced properties of local datasets and varying availability of clients, we further propose a CS algorithm based on the UCB policy and virtual queue technique (CS-UCB-Q). An upper bound is also derived, which shows that the expected performance regret of the proposed CS-UCB-Q algorithm can have a sub-linear growth over communication rounds under certain conditions. Besides, the convergence performance of FL training is also analyzed. Finally, simulation results validate the efficiency of the proposed algorithms.
The multi-armed bandit (MAB) is a classical online optimization model for the trade-off between exploration and exploitation. The traditional MAB is concerned with finding the arm that minimizes the mean cost. However, minimizing the mean does not take the risk of the problem into account. We now want to accommodate risk-averse decision makers. In this work, we introduce a coherent risk measure as the criterion to form a risk-averse MAB. In particular, we derive an index-based online sampling framework for the risk-averse MAB. We develop this framework in detail for three specific risk measures, i.e. the conditional value-at-risk, the mean-deviation and the shortfall risk measures. Under each risk measure, the convergence rate for the upper bound on the pseudo regret, defined as the difference between the expectation of the empirical risk based on the observation sequence and the true risk of the optimal arm, is established.
Restless Multi-Armed Bandits (RMABs) have been popularly used to model limited resource allocation problems. Recently, these have been employed for health monitoring and intervention planning problems. However, the existing approaches fail to account for the arrival of new patients and the departure of enrolled patients from a treatment program. To address this challenge, we formulate a streaming bandit (S-RMAB) framework, a generalization of RMABs where heterogeneous arms arrive and leave under possibly random streams. We propose a new and scalable approach to computing index-based solutions. We start by proving that index values decrease for short residual lifetimes, a phenomenon that we call index decay. We then provide algorithms designed to capture index decay without having to solve the costly finite horizon problem, thereby lowering the computational complexity compared to existing methods.We evaluate our approach via simulations run on real-world data obtained from a tuberculosis intervention planning task as well as multiple other synthetic domains. Our algorithms achieve an over 150x speed-up over existing methods in these tasks without loss in performance. These findings are robust across multiple domains.
We consider the problem of near-optimal arm identification in the fixed confidence setting of the infinitely armed bandit problem when nothing is known about the arm reservoir distribution. We (1) introduce a PAC-like framework within which to derive and cast results; (2) derive a sample complexity lower bound for near-optimal arm identification; (3) propose an algorithm that identifies a nearly-optimal arm with high probability and derive an upper bound on its sample complexity which is within a log factor of our lower bound; and (4) discuss whether our log^2(1/delta) dependence is inescapable for two-phase (select arms first, identify the best later) algorithms in the infinite setting. This work permits the application of bandit models to a broader class of problems where fewer assumptions hold.
Jan Oksanen
,Visa Koivunen
,H. Vincent Poor
.
(2012)
.
"A Sensing Policy Based on Confidence Bounds and a Restless Multi-Armed Bandit Model"
.
Jan Oksanen
هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا