No Arabic abstract
We study a decentralized channel allocation problem in an ad-hoc Internet of Things network underlaying on the spectrum licensed to a primary cellular network. In the considered network, the impoverished channel sensing/probing capability and computational resource on the IoT devices make them difficult to acquire the detailed Channel State Information (CSI) for the shared multiple channels. In practice, the unknown patterns of the primary users transmission activities and the time-varying CSI (e.g., due to small-scale fading or device mobility) also cause stochastic changes in the channel quality. Decentralized IoT links are thus expected to learn channel conditions online based on partial observations, while acquiring no information about the channels that they are not operating on. They also have to reach an efficient, collision-free solution of channel allocation with limited coordination. Our study maps this problem into a contextual multi-player, multi-armed bandit game, and proposes a purely decentralized, three-stage policy learning algorithm through trial-and-error. Theoretical analyses shows that the proposed scheme guarantees the IoT links to jointly converge to the social optimal channel allocation with a sub-linear (i.e., polylogarithmic) regret with respect to the operational time. Simulations demonstrate that it strikes a good balance between efficiency and network scalability when compared with the other state-of-the-art decentralized bandit algorithms.
Next-generation networks are expected to be ultra-dense with a very high peak rate but relatively lower expected traffic per user. For such scenario, existing central controller based resource allocation may incur substantial signaling (control communications) leading to a negative effect on the quality of service (e.g. drop calls), energy and spectrum efficiency. To overcome this problem, cognitive ad-hoc networks (CAHN) that share spectrum with other networks are being envisioned. They allow some users to identify and communicate in `free slots thereby reducing signaling load and allowing the higher number of users per base stations (dense networks). Such networks open up many interesting challenges such as resource identification, coordination, dynamic and context-aware adaptation for which Machine Learning and Artificial Intelligence framework offers novel solutions. In this paper, we discuss state-of-the-art multi-armed multi-player bandit based distributed learning algorithms that allow users to adapt to the environment and coordinate with other players/users. We also discuss various open research problems for feasible realization of CAHN and interesting applications in other domains such as energy harvesting, Internet of Things, and Smart grids.
In this paper, we propose a new multi-objective contextual multi-armed bandit (MAB) problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective MAB problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives and the distribution of the reward depends on the context that is provided to the learner at the beginning of each round. We call this problem contextual multi-armed bandit with a dominant objective (CMAB-DO). In CMAB-DO, the goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its total reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. First, we show that the optimal arm lies in the Pareto front. Then, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and define two performance measures: the 2-dimensional (2D) regret and the Pareto regret. We show that both the 2D regret and the Pareto regret of MOC-MAB are sublinear in the number of rounds. We also compare the performance of the proposed algorithm with other state-of-the-art methods in synthetic and real-world datasets. The proposed model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.
Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, UCB1 and Thompson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to 16% gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.
We introduce a framework for decentralized online learning for multi-armed bandits (MAB) with multiple cooperative players. The reward obtained by the players in each round depends on the actions taken by all the players. Its a team setting, and the objective is common. Information asymmetry is what makes the problem interesting and challenging. We consider three types of information asymmetry: action information asymmetry when the actions of the players cant be observed but the rewards received are common; reward information asymmetry when the actions of the other players are observable but rewards received are IID from the same distribution; and when we have both action and reward information asymmetry. For the first setting, we propose a UCB-inspired algorithm that achieves $O(log T)$ regret whether the rewards are IID or Markovian. For the second section, we offer an environment such that the algorithm given for the first setting gives linear regret. For the third setting, we show that a variation of the `explore then commit algorithm achieves almost log regret.
We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions and rewards. The environment may appear differently to different players, $textit{i.e.}$, the reward distributions for a given arm are heterogeneous across players. In the case of a collision (when more than one player plays the same arm), we allow for the colliding players to receive non-zero rewards. The time-horizon $T$ for which the arms are played is emph{not} known to the players. Within this setup, where the number of players is allowed to be greater than the number of arms, we present a policy that achieves near order-optimal expected regret of order $O(log^{1 + delta} T)$ for some $0 < delta < 1$ over a time-horizon of duration $T$. This paper is currently under review at IEEE Transactions on Information Theory.