No Arabic abstract
Setting up the future Internet of Things (IoT) networks will require to support more and more communicating devices. We prove that intelligent devices in unlicensed bands can use Multi-Armed Bandit (MAB) learning algorithms to improve resource exploitation. We evaluate the performance of two classical MAB learning algorithms, UCB1 and Thompson Sampling, to handle the decentralized decision-making of Spectrum Access, applied to IoT networks; as well as learning performance with a growing number of intelligent end-devices. We show that using learning algorithms does help to fit more devices in such networks, even when all end-devices are intelligent and are dynamically changing channel. In the studied scenario, stochastic MAB learning provides a up to 16% gain in term of successful transmission probabilities, and has near optimal performance even in non-stationary and non-i.i.d. settings with a majority of intelligent devices.
Next-generation networks are expected to be ultra-dense with a very high peak rate but relatively lower expected traffic per user. For such scenario, existing central controller based resource allocation may incur substantial signaling (control communications) leading to a negative effect on the quality of service (e.g. drop calls), energy and spectrum efficiency. To overcome this problem, cognitive ad-hoc networks (CAHN) that share spectrum with other networks are being envisioned. They allow some users to identify and communicate in `free slots thereby reducing signaling load and allowing the higher number of users per base stations (dense networks). Such networks open up many interesting challenges such as resource identification, coordination, dynamic and context-aware adaptation for which Machine Learning and Artificial Intelligence framework offers novel solutions. In this paper, we discuss state-of-the-art multi-armed multi-player bandit based distributed learning algorithms that allow users to adapt to the environment and coordinate with other players/users. We also discuss various open research problems for feasible realization of CAHN and interesting applications in other domains such as energy harvesting, Internet of Things, and Smart grids.
This paper investigates learning-based caching in small-cell networks (SCNs) when user preference is unknown. The goal is to optimize the cache placement in each small base station (SBS) for minimizing the system long-term transmission delay. We model this sequential multi-agent decision making problem in a multi-agent multi-armed bandit (MAMAB) perspective. Rather than estimating user preference first and then optimizing the cache strategy, we propose several MAMAB-based algorithms to directly learn the cache strategy online in both stationary and non-stationary environment. In the stationary environment, we first propose two high-complexity agent-based collaborative MAMAB algorithms with performance guarantee. Then we propose a low-complexity distributed MAMAB which ignores the SBS coordination. To achieve a better balance between SBS coordination gain and computational complexity, we develop an edge-based collaborative MAMAB with the coordination graph edge-based reward assignment method. In the non-stationary environment, we modify the MAMAB-based algorithms proposed in the stationary environment by proposing a practical initialization method and designing new perturbed terms to adapt to the dynamic environment. Simulation results are provided to validate the effectiveness of our proposed algorithms. The effects of different parameters on caching performance are also discussed.
We study a decentralized channel allocation problem in an ad-hoc Internet of Things network underlaying on the spectrum licensed to a primary cellular network. In the considered network, the impoverished channel sensing/probing capability and computational resource on the IoT devices make them difficult to acquire the detailed Channel State Information (CSI) for the shared multiple channels. In practice, the unknown patterns of the primary users transmission activities and the time-varying CSI (e.g., due to small-scale fading or device mobility) also cause stochastic changes in the channel quality. Decentralized IoT links are thus expected to learn channel conditions online based on partial observations, while acquiring no information about the channels that they are not operating on. They also have to reach an efficient, collision-free solution of channel allocation with limited coordination. Our study maps this problem into a contextual multi-player, multi-armed bandit game, and proposes a purely decentralized, three-stage policy learning algorithm through trial-and-error. Theoretical analyses shows that the proposed scheme guarantees the IoT links to jointly converge to the social optimal channel allocation with a sub-linear (i.e., polylogarithmic) regret with respect to the operational time. Simulations demonstrate that it strikes a good balance between efficiency and network scalability when compared with the other state-of-the-art decentralized bandit algorithms.
Assistive multi-armed bandit problems can be used to model team situations between a human and an autonomous system like a domestic service robot. To account for human biases such as the risk-aversion described in the Cumulative Prospect Theory, the setting is expanded to using observable rewards. When robots leverage knowledge about the risk-averse human model they eliminate the bias and make more rational choices. We present an algorithm that increases the utility value of such human-robot teams. A brief evaluation indicates that arbitrary reward functions can be handled.
By exploiting the computing power and local data of distributed clients, federated learning (FL) features ubiquitous properties such as reduction of communication overhead and preserving data privacy. In each communication round of FL, the clients update local models based on their own data and upload their local updates via wireless channels. However, latency caused by hundreds to thousands of communication rounds remains a bottleneck in FL. To minimize the training latency, this work provides a multi-armed bandit-based framework for online client scheduling (CS) in FL without knowing wireless channel state information and statistical characteristics of clients. Firstly, we propose a CS algorithm based on the upper confidence bound policy (CS-UCB) for ideal scenarios where local datasets of clients are independent and identically distributed (i.i.d.) and balanced. An upper bound of the expected performance regret of the proposed CS-UCB algorithm is provided, which indicates that the regret grows logarithmically over communication rounds. Then, to address non-ideal scenarios with non-i.i.d. and unbalanced properties of local datasets and varying availability of clients, we further propose a CS algorithm based on the UCB policy and virtual queue technique (CS-UCB-Q). An upper bound is also derived, which shows that the expected performance regret of the proposed CS-UCB-Q algorithm can have a sub-linear growth over communication rounds under certain conditions. Besides, the convergence performance of FL training is also analyzed. Finally, simulation results validate the efficiency of the proposed algorithms.